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Abstract 
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the  distance  to  the  scene  is  stored  for  each  pixel  location  in  the  image.  In  addition, 
uncertainty  about  the  structure  values  is  represented  explicitly  by  the  estimate  co- 
variance.  This  representation  is  maintained  over  time  by  a  stochastic  recursive  esti¬ 
mator,  the  Kalman  filter.  The  estimator  consists  of  two  stages  which  are  repeated 
for  each  new  arriving  image.  The  update  stage  improves  the  current  depth  estimate 
by  incorporating  the  latest  image  measurement.  It  depends  on  the  particular  visual 
mechanism  being  employed  and  amounts  to  an  iterative  relaxation  algorithm  similar 
to  conventional  single-frame  algorithms.  The  prediction  stage  transforms  the  current 
depth  estimate  into  the  next  time-step  to  account  for  changes  in  the  depth  values  that 
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the  sequence.  This  step  requires  a  three-dimensional  transformation  (translation  and 
rotation)  of  each  depth  map  entry  followed  by  a  resampling  operation  to  maintain 
the  regular  map  representation. 

The  temporal  reconstruction  algorithm  is  described  in  detail  for  the  recovery  of 
structure  from  motion  with  and  without  optical  flow  and  for  structure  from  shad¬ 
ing.  Extensive  experimental  evaluation  shows  that  the  temporal  algorithm  not  only 
improves  the  quality  of  estimates  significantly  over  time  but  also  requires  orders  of 
magnitude  less  time  per  image  than  previous  ’’instantaneous”  techniques. 
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Introduction 

Chapter  1 


1.1  Problem  Statement 

How  can  we  perceive  three-dimensional  structure?  Brightness  images  of  three-dimensional 
scenes  contain  a  wealth  of  information  which  humans  can  exploit  through  a  variety 
of  mechanisms  to  extract  information  about  the  structure  of  objects.  Moreover,  this 
cognitive  process  has  a  temporal  dimension:  humans  can  maintain  and  improve  an 
’’idea”  of  a  three-dimensional  structure  as  they  acquire  more  images  of  a  scene  from 
varying  viewpoints. 

The  objective  of  this  thesis  is  to  formalize  the  problem  of  temporal  surface  re¬ 
construction  outlined  above  and  to  investigate  computational  visual  algorithms  for 
its  solution.  Let  us  begin  by  stating  the  problem  more  precisely: 


Temporal  Surface  Reconstruction: 

We  are  given  a  sequence  of  intensity  images  of  a  three-dimensional  scene. 
The  objective  is  to  estimate  the  three-dimensional  structure  of  the  ob¬ 
served  scene. 


The  solution  to  the  above  problem  will  consist  of  answers  to  the  following  ques¬ 
tions: 


•  Representation:  What  are  the  representations  of  three-dimensional  structure 
suitable  for  surface  reconstruction,  considering  in  particular  the  ability  to  main¬ 
tain  the  representation  over  time? 
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•  Visual  Mechanisms:  What  are  the  visual  mechanisms  that  can  be  exploited  to 
recover  information  about  three-dimensional  surfaces  from  brightness  images? 

•  Algorithms:  What  are  the  computational  algorithms  which  are  best  suited  to 
exploit  the  above  visual  mechanisms  and  the  above  structure  representations 
to  obtain  estimates  of  the  three-dimensional  surfaces  which  are  closest  to  the 
true  surfaces  at  the  lowest  computational  cost? 

This  thesis  will  investigate  one  solution  to  the  above  problems  and  contrast  it  with 
other  possible  alternatives. 


1.2  An  Illustrative  Example 

To  gain  some  insight  into  the  difRculties  involved,  let  us  consider  a  straightforward 
solution  to  the  above  problem.  It  is  well-known  that  stereoscopic  vision  is  a  primary 
source  of  three-dimensional  perceptive  capability  in  humans.  Marr  and  Poggio  [61] 
argue  that  humans  match  brightness  ’’edges”  in  the  left  and  right  images.  The 
disparity  between  matching  edge  locations  is  inversely  proportional  to  the  distance 
of  the  corresponding  point  in  the  world.  In  this  case,  the  representation  of  the  three- 
dimensional  structure  would  be  the  distance  of  points  on  the  surface  that  project 
to  edge  locations  in  the  image.  The  visual  mechanism  is  the  inverse  relationship 
between  the  distance  of  surface  points  and  the  disparity  of  matching  edge  locations 
in  the  image.  The  algorithm  (such  as  the  one  by  Crimson  [31])  consists  of  extracting 
edge  locations,  matching  them  in  the  left  and  right  images  and  calculating  the  depth 
from  the  resulting  disparity. 

Neither  this  description  nor  the  original  papers  cited  above  address  the  temporal 
aspect  of  the  problem  i.e.  how  structure  information  can  be  recovered  using  stereo  if 
an  entire  sequence  of  stereo  pairs  is  available.  Of  course,  it  would  be  strughtforward 
to  repeat  the  instantaneous  algorithm  for  every  pair  of  images  in  the  sequence  as  it 
becomes  available.  This  appears  counterintuitive  since  the  calculation  for  a  given  pair 
of  frames  completely  disregards  estimates  from  previous  frames  and  can  therefore  not 
hope  to  provide  the  continuous  estimation  and  estimate  improvement  which  humans 
exhibit.  We  can  formulate  the  difficulties  and  disadvantages  affiliated  with  such  an 
’’instantaneous”  surface  reconstruction  procedure  more  precisely: 

1.  Instantaneous  structure  estimates  are  sensitive  to  measurement  errors  and 
noise.  Combining  estimates  from  a  number  frames  introduces  redundancies 
that  can  be  exploited  to  reduce  the  effect  of  errors.  However,  the  instanta¬ 
neous  approach  cannot  combine  measurements  and  therefore  has  no  temporal 
error-reduction  effect. 
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2.  In  order  to  combine  estimates  from  different  frames,  the  estimates  must  be 
compatible.  However,  the  relative  position  of  camera  and  scene  may  change 
during  the  acquisition  of  frames  and  thereby  cause  instantaneous  structure 
estimates  taken  at  different  positions  to  be  incompatible.  Transformations 
of  structure  estimates  to  account  for  camera  displacement  are  necessary  to 
overcome  isolated  processing  of  images. 

3.  Once  temporal  structure  estimates  are  compatible,  we  need  a  procedure  to 
"combine”  them.  Measurements  taken  at  different  times  may  vary  in  terms 
of  error  and  noise.  In  particular  in  the  case  where  camera  and  scene  are  in 
relative  motion,  the  uncertainty  in  the  structure  estimate  will  vary  spatially 
and  temporally.  The  instantaneous  approach  has  no  way  of  representing  the 
uncertainty  and  no  way  of  using  such  a  representation  to  improve  estimates  of 
high  uncertainty  by  combining  them  with  others  of  lower  uncertainty. 

4.  Typically,  instantaneous  surface  reconstruction  procedures  such  as  the  Marr- 
Poggio/ Crimson  stereo  algorithm  mentioned  above  are  computationally  quite 
expensive.  In  the  instantaneous  scheme  these  expensive  procedures  must  be 
repeated  for  each  frame  and  since  processing  is  done  in  isolation  no  computa* 
tional  benefit  can  be  drawn  from  previous  estimates. 

1.3  The  Temporal  Dimension 

As  we  will  see  in  more  detail  in  chapter  2,  most  work  in  visual  surface  reconstruction 
is  of  the  instantaneous  nature  described  above.  The  emphasis  of  this  thesis  is  on  the 
temporal  aspect  of  surface  reconstruction  and  the  discussion  above  illustrates  some 
of  the  specific  issues  that  must  be  considered.  At  the  same  time,  they  can  serve 
as  the  basis  for  the  set  of  criteria  which  we  may  use  to  judge  the  effectivity  of  a 
temporal  surface  reconstruction  scheme.  Based  on  the  observations  made  above,  the 
following  are  minimal  requirements  for  any  procedure  that  we  may  consider; 

1.  Quality  improvement: 

The  quality  of  estimates  should  improve  by  combining  estimates  over  time. 

2.  Motion  trans formations: 

Estimates  should  be  maintained  in  such  a  way  that  a  relative  motion  of  camera 
and  scene  is  accounted  for. 

3.  Uncertainty  representation: 

Estimates  should  be  maintained  along  with  their  uncertainty  and  the  combi¬ 
nation  of  estimates  should  take  the  uncertainty  into  account. 
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4.  Computational  simplicity: 

Results  obtained  in  previous  time  steps  can  be  used  to  reduce  the  amount  of 
computation  necessary  by  providing  initial  values  for  the  next  step  that  are 
close  to  the  solution. 

In  search  of  a  solution  to  the  temporal  reconstruction  problem  that  satisfies  the 
above  criteria  a  look  at  related  problems  in  other  disciplines  is  enlightening.  Esti¬ 
mation  theory  addresses  the  problem  of  analyzing  a  set  of  measurements  to  estimate 
the  value  of  a  quantity  which  is  related  to  the  measurements  in  a  defined  way.  This 
pertains  to  the  problem  at  hand,  since  the  image  measurements  available  for  surface 
reconstruction  are  related  to  the  quantities  which  describe  the  surface  structure  in 
a  known  and  predetermined  way.  We  have  seen  this  in  the  case  of  stereo  surface 
reconstruction  above  and  have  referred  to  this  relationship  as  the  visual  mechanism. 

A  brief  look  at  the  properties  of  recursive  estimation  methods  provides  insights 
on  a  number  of  the  issues  mentioned  previously. 

•  Measurements  and  estimation  quantities  can  be  modeled  stochastically  by 
probability  distributions  to  describe  the  effect  of  errors  or  measurement  noise. 
Uncertainty  can  be  represented  explicitly  as  the  covariance  matrix  of  these 
probability  distributions  and  can  be  used  to  weight  measurements  of  differing 
quality  appropriately. 

•  Optimal  solutions  (in  terms  of  the  difference  between  estimate  and  true  value) 
have  been  proven  and  are  readily  available. 

•  Recursive  estimation  theory,  in  particular,  addresses  the  problem  of  estimat¬ 
ing  the  internal  state  of  a  dynamical  system  from  external  measurements.  It 
provides  a  solution  to  this  problem  which  incrementally  improves  an  estimate 
of  the  system’s  state  with  every  new  measurement  that  becomes  available. 

These  characteristics  make  it  particularly  interesting  to  investigate  techniques  from 
recursive  estimation  theory  for  the  solution  of  the  temporal  surface  reconstruction 
problem.  As  we  will  see,  casting  surface  reconstruction  in  the  framework  of  estima¬ 
tion  theory  does  not  force  us  to  abandon  the  results  of  instantaneous  methods  but 
rather  provides  a  natural  way  of  embedding  instantaneous  techniques  in  a  temporal 
estimation  scheme  and  explicitly  modeling  uncertainty. 


1.4  Contributions 

The  contributions  of  this  research  work  are  as  follows: 

•  The  problem  of  temporal  surface  reconstruction  is  formulated  and  formalized 
in  the  framework  of  recursive  estimation  theory.  This  formulation  serves  as 
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a  unifying  theory  for  previous  approaches  to  temporal  estimation  and  natu¬ 
rally  subsumes  existing  instantaneous  procedures  by  embedding  them  into  a 
stochastic  framework  and  explicitly  representing  uncertainty. 

•  A  novel  algorithm  for  the  estimation  of  depth  from  motion  image  sequences 
using  optical  flow  is  derived  from  the  temporal  surface  reconstruction  theory, 
has  been  implemented  and  evaluated  experimentally.  Although  this  specific 
problem  has  been  addressed  previously,  the  solution  presented  here  is  not  re¬ 
stricted  to  particular  types  of  motion  and  incorporates  prior  models  of  surface 
structure  (smoothness)  in  a  new  way  based  on  the  stochastic  modeling  from 
estimation  theory. 

•  A  novel  algorithm  for  the  estimation  of  depth  from  shading  information  is 
derived  from  the  temporal  surface  reconstruction  theory,  has  been  implemented 
and  evaluated  experimentally. 

•  A  novel  algorithm  for  the  estimation  of  depth  from  motion  image  sequences 
using  the  ’’direct”  approach  (without  optical  flow)  is  derived  from  the  temporal 
surface  reconstruction  theory,  has  been  implemented  and  evaluated  experimen¬ 
tally. 

•  This  thesis  provides  extensive  experimental  evaluation  of  the  temporal  surface 
reconstruction  theory  on  a  variety  of  real  and  synthetic  images. 

•  A  novel  algorithm  for  the  prediction/motion  warping  of  three-dimensional  sur¬ 
faces  represented  as  depth  maps  to  account  for  the  effect  of  rotational  and 
translational  motion  on  the  relative  position  of  observer  and  surface. 

•  A  detailed  computational  evaluation  of  temporal  surface  reconstruction  in 
terms  of  complexity,  run-time  and  implementation  on  parallel  processors. 

1.5  Structure  of  the  Thesis 

Chapter  2:  The  presentation  in  this  thesis  builds  on  the  work  in  instantaneous 
surface  reconstruction  and  uses  techniques  from  estimation  theory.  We  begin  by 
introducing  some  fundamental  concepts  and  notation  for  imaging.  A  survey  of  pre¬ 
vious  works  in  instantaneous  recovery  of  structure  information  from  a  variety  of 
visual  mechanisms  is  provided.  Finally  a  set  of  detailed  examples  of  instantaneous 
surface  reconstruction  procedures  which  will  later  be  embedded  into  the  temporal 
framework  are  studied  here. 

Chapter  3:  This  chapter  summarizes  previous  work  on  the  temporal  aspect  of 
visual  surface  reconstruction.  It  provides  a  categorization  of  research  according  to 
the  representation  and  the  type  of  algorithm  used. 
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Chapter  4:  Here  we  recapitulate  the  essential  results  from  recursive  estimation 
theory  that  will  be  used  in  this  thesis.  Some  interesting  and  relevant  properties  of 
the  Kalman  filter  are  presented  here. 

Chapter  5:  This  chapter  qualitatively  outlines  the  primary  contribution  of  this 
work:  it  describes  how  recursive  estimation  theory  can  be  applied  to  the  problem 
of  temporal  surface  reconstruction  such  that  the  solution  addresses  the  criteria  set 
forth  previously.  The  resulting  solution  is  an  iterative  algorithm  which  consists  of 
two  stages  for  each  new  image  that  becomes  available:  an  ’’update’'  stage  which 
incorporates  the  new  measurement  into  the  current  structure  estimate  and  a  ’’pre¬ 
diction”  stage  which  accounts  for  relative  motion  between  camera  and  scene  during 
image  acquisition.  Only  the  update  stage  depends  on  the  visual  mechanism  which  is 
used  to  recover  structure  from  image  data. 

Chapter  0:  Here  we  discuss  the  initialization  of  the  temporally  recursive  estima¬ 
tion  algorithm  and  show  how  a  proper  choice  of  initial  values  permits  prior  models  of 
surface  structure  such  as  smoothness  to  be  imposed  on  the  result  of  the  estimation 
procedure. 

Chapter  7:  The  prediction  stage  of  the  temporal  reconstruction  algorithm  is 
independent  of  the  particular  visual  mechanism  and  is  described  in  chapter  7.  The 
procedure  is  equivalent  to  the  motion  warping  of  a  surface  in  three  dimensions  and 
the  subsequent  resampling  of  the  warped  surface  on  a  regular  grid. 

Chapter  8:  This  chapter  describes  the  update  stage  of  the  temporal  estimator 
for  the  case  of  depth  from  motion  using  optical  flow.  It  describes  how  to  choose  the 
matrices  and  vectors  that  determine  the  filter  and  how  the  optical  flow  which  is  used 
as  a  measurement  in  the  formulation  can  be  obtained  from  images. 

Chapter  9:  Here  the  update  stage  of  the  temporal  estimator  for  the  case  of 
depth  from  shading  is  described. 

Chapter  10:  This  chapter  describes  the  update  stage  of  the  temporal  estimator 
for  the  case  of  depth  from  motion  without  the  use  of  optical  flow. 

Each  one  of  the  chapters  8,  9  and  10  shows  the  result  of  extensive  experimentation 
with  the  temporal  surface  reconstruction  algorithms  on  real  images  for  the  particular 
visual  mechanism  . 

Chapter  11:  Here  we  discuss  and  evaluate  the  temporal  reconstruction  algo¬ 
rithm  from  various  points  of  view.  A  detailed  analysis  of  complexity  and  run-time 
is  provided.  We  discuss  possible  implementations  on  parallel  processors  and  special- 
purpose  hardware  and  assess  possible  performance  improvements.  Finally,  a  com¬ 
plete  list  of  assumptions  and  approximations  made  throughout  the  thesis  is  provided 
to  help  identify  weak  points  and  to  serve  as  the  basis  for  comparisons. 

Chapter  12:  This  chapter  contains  summarizing  and  concluding  remarks  as  well 
as  perspectives  on  future  research. 

Three  appendices  at  the  end  of  this  thesis  provide  details  which  are  not  likely  to 
be  of  interest  to  a  casual  reader  but  are  useful  for  the  purpose  of  implementation  for 
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example.  Appendix  A  describes  the  theoretical  derivation  for  the  implicit  Kalman 
filter  introduced  in  chapter  4.  Appendix  B  provides  the  detailed  equations  for  the 
prediction  of  depth  and  covariance  that  are  introduced  in  chapter  7.  Appendix  C 
serves  as  an  implementor’s  handbook.  In  it,  the  the  Kalman  filter  equations  for 
the  filter  update  corresponding  to  the  structure  from  optical  flow  case  of  chapter  8 
are  worked  out  in  such  a  detail  that  they  may  be  immediately  reimplemented  by  the 
reader.  It  also  provides  a  set  of  practical  ideas  which  can  support  the  implementation 
and  make  it  more  flexible. 


Instantaneous  Surface 
Reconstruction 


Chapter  2 


This  chapter  begins  by  introducing  some  basic  notation  used  in  the  recovery  of 
structure  information  from  images.  It  then  provides  a  survey  of  previous  work  on 
the  instantaneous  reconstruction  of  surfaces  from  images.  Finally,  it  gives  three 
detailed  examples  of  visual  surface  reconstruction  procedures  which  operate  in  an 
instantaneous  fashion  in  the  sense  described  in  the  introduction.  The  goal  is  to  later 
(in  chapters  8,  9,  10)  embed  these  procedures  in  the  Kalman  filter  framework  so  that 
they  may  operate  in  a  time-continuous  fashion. 

2.1  Basic  Definitions  and  Notation 

We  begin  with  some  basic  definitions  and  notation  which  is  used  in  all  of  the 
surface  reconstruction  procedures.  Figure  2.1  shows  the  imaging  situation  that  we 
will  consider.  A  viewer- centered  coordinate  system  is  introduced  such  that  the  origin 
lies  at  the  focal  point,  the  x  —  y  plane  is  parallel  to  the  image  plane  and  the  z-axis 
points  outward  along  the  optical  axis.  A  point  P  =  [X,  Y,  Z]  on  the  surface  of  an 
object  in  the  scene  is  projected  to  a  point  P'  =  [x,y,f]  in  the  image  plane  (on  the 
surface  of  the  CCD  image  sensor),  where  the  coordinates  are  related  by  the  equations 
of  perspective  projection* 

®  =  /^  and  y  =  fj  (2.1) 

and  /  is  the  focal  length  of  the  camera. 

The  camera  sensor  provides  measurements  of  brightness  values  on  a  rectangular 
grid  which  we  represent  in  an  array  (Eij)  where  i  and  j  index  the  rows  and  columns 
of  the  image  array  as  shown  in  figure  2.2.  Note  that  the  coordinate  system  is  centered 
on  the  sensor  grid  while  the  indexing  of  the  image  array  begins  at  (0,0)  in  the  upper 

*  Shading  methods  use  orthographic  instead  of  perspective  projection. 
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Figure  2.1:  The  imaging  situation. 


left  corner  in  row  major  order.  The  brightness  array  has  n  rows  and  m  columns.  If 
the  physical  size  of  the  image  sensor  is  w  x  h,  the  spacing  between  sensor  elements 


w  h 

Ax  =  —  and  Ay  =  — 
m  n 


The  relationship  between  physical  coordinates  and  indices  in  the  image  array  is  given 

by 


10  —  1 . 

X  =  {j - ^)Ax 


10  —  1 


y  =  (*  — ~>^y 

y  h  —  1 


The  Z-coordinate  of  a  point  P  is  the  distance  from  the  origin  of  the  coordinate 
system  to  the  perpendicular  projection  of  P  onto  the  optical  axis  and  is  referred  to  as 
the  depth  of  that  point.  The  array  (Zjj)  consisting  of  the  depth  values  corresponding 
to  each  of  the  locations  in  the  image  sensor  grid  is  referred  to  as  a  depth  map. 

For  the  description  of  relative  motion  between  camera  and  scene  we  restrict  our¬ 
selves  to  rigid  body  motions.  Such  motions  can  be  described  by  a  translation  vector 
t  and  a  rotation  matrix  17  both  of  which  will  be  given  relative  to  the  coordinate 
system  of  the  camera  before  the  motion.  Points  Pi,  and  Pk+i  before  and  after  the 
motion  transformation  are  related  by 


Pk+i  =  -t  -  nPk 


(2.5) 
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Figure  2.2:  The  image  sensor  array  and  indexing. 


In  the  case  of  small  motions  between  frames,  the  rotation  can  be  described  by  a 
vector  w: 

Pk+\  -t  -  u>  X  Pfc  (2.6) 

2.2  Instantaneous  Surface  Reconstruction 

Research  in  instantaneous  reconstruction  of  surface  structure  from  images  can  be 
categorized  by  the  visual  mechanism  which  is  used.  A  brief  overview  of  past  work 
using  this  categorization  in  roughly  chronological  order  follows. 


2.2.1  Structure  from  Shading 

When  the  image  brightness  can  be  modeled  as  a  known  function  of  the  surface  struc¬ 
ture,  this  knowledge  can  be  used  to  infer  the  surface  structure  from  an  single  image. 
Horn  [43]  first  formulated  and  proposed  a  solution  to  this  problem  in  his  thesis.  A 
variational  approach  was  presented  in  [50].  Pentland  [77]  introduced  a  greatly  sim- 
plifed  solution  approach  by  assuming  that  surfaces  are  locally  spherical.  Photometric 
Stereo  [49]  is  an  interesting  variant  in  which  several  images  under  different  illumi¬ 
nation  conditions  are  used  to  eliminate  some  of  the  ambiguity  inherent  in  a  single 
image.  For  a  collection  of  essential  papers  and  complete  a  bibliography,  the  reader 
is  referred  to  Horn  and  Brooks  [46]. 
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2.2.2  Structure  from  Texture 

If  the  surface  of  an  object  is  covered  by  a  texture  pattern  in  which  texture  elements 
have  a  constant  or  known  size,  the  relative  size  of  these  patterns  in  the  image  can 
be  used  to  infer  the  objects  shape.  This  problem  was  already  addressed  in  Horn’s 
thesis  [43]  and  received  detailed  attention  later  by  Bajcsy  and  Lieberman  [5]  and  in 
the  theses  of  Stevens  [89]  and  Kender  [56].  Obviously,  this  visual  cue  is  restricted  to 
only  a  specific  class  of  scenes  or  parts  of  scenes. 


2.2.3  Structure  from  Stereo 

The  strongest  visual  mechanism  used  by  humans  to  discern  the  three-dimensional 
structure  is  stereoscopic  vision.  Since  we  have  two  eyes,  the  difference  in  position 
at  which  a  point  in  the  scene  appears  in  the  two  images  can  be  used  to  determine 
its  depth  via  a  simple  triangulation.  Marr  and  Poggio  [60],  [61]  first  studied  the 
human  stereoscopic  vision  system  and  proposed  a  model  for  an  underlying  computa¬ 
tional  mechanism.  Crimson  [31],  [32]  combined  this  theory  with  the  Marr-Hildreth 
approach  to  edge  detection  to  formulate  an  algorithm  that  recovered  structure  infor¬ 
mation  at  the  location  of  edges  in  the  image.  Using  the  variational  approach  later 
referred  to  as  “regularization”,  this  algorithm  recovered  dense  structure  information 
from  two  images. 


2.2.4  Structure  from  Motion 

When  a  camera  moves  relative  to  a  surface,  two  or  more  images  can  be  used  much 
in  the  same  manner  as  in  stereoscopic  vision:  the  difference  in  projected  location 
of  a  scene  point  contains  information  about  the  three-dimensional  location  of  the 
point.  While  the  use  of  small  camera  displacements  can  greatly  reduce  the  matching 
problem  which  plagues  stereo  algorithms,  the  fact  that  the  relative  camera  motion 
is  either  unknown  or  uncertain  introduces  other  complications.  A  first  class  of  al¬ 
gorithms  assumes  that  the  optical  flow,  an  approximation  of  the  projection  of  the 
three-dimensional  velocity  field  has  been  computed  and  can  be  used  to  recover  struc¬ 
ture.  Examples  are  Tsai  and  Huang  [102],  Bruss  and  Horn  [16],  Longuet- Higgins  and 
Prazdny  [59],  Waxman  and  Ullman  [105],  Mitiche  [70],  Spetsakis  and  Aloimonos  [88] 
and  Heeger  and  Jepson[34].  Since  the  computation  of  the  optical  flow  is  expensive, 
recent  proposals  have  sought  to  avoid  this  step  and  directly  extract  structure  infor¬ 
mation  from  image  brightness:  Kanatani  [55]  Negahdaripour,  Weldon  and  Horn  [75], 
[48]  and  Aloimonos  and  Herve  [1]  are  examples  of  this  approach 
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2.2.5  Structure  from  Focus 

For  a  given  focal  length  of  the  imaging  device,  only  certain  scene  points  at  a  specific 
distance  (given  by  the  Gaussian  lens  law)  from  the  camera  will  appear  in  focus.  By 
looking  at  the  frequency  content  of  the  image  regions,  the  focussed  regions  can  be 
identified  and  their  depth  computed  using  the  lens  law.  Krotkov  [58]  analyzed  the 
performance  of  several  criteria  to  identify  image  regions  in  focus  and  showed  how 
it  could  be  used  to  recover  depth.  Pentland  [78]  addresses  the  same  problem  as  do 
Nayar  and  Nakagawa  [73]  for  the  case  of  rough  surfaces.  In  any  case,  the  use  of  this 
method  requires  the  ability  to  change  the  focal  length  of  the  camera  in  a  very  precise 
and  known  manner. 


2.2.6  Other  Related  Work 

Ikeuchi,  Horn  and  Schunck  [50],  [47]  as  well  as  Crimson  [31]  had  realized  the  need 
to  impose  restrictions  on  the  structure  of  the  surfaces  that  they  reconstructed  from 
various  visual  mechanisms.  This  was  done  both  to  reduce  the  effect  of  measure¬ 
ment  noise  and  to  obtain  dense  information  because  reconstruction  was  only  done  at 
sparse  locations.  Terzopoulos  [96],  [97],  contributed  much  to  formalizing  this  process 
by  introducing  the  membrane  and  thin  plate  surface  models  and  analyzing  them 
theoretically  and  experimentally.  Poggio  et  al.  [80]  cast  this  as  the  regularization 
approach  to  both  instantaneous  surface  reconstruction  and  the  solution  of  other  ill- 
posed  problems  in  early  vision.  The  deterministic  regularization  approach  which 
dominates  most  instantaneous  procedures  as  we  will  see  next  was  later  enhanced 
by  probabilisitic  modeling  (Geman  and  Geman  [29],  Marroquin  [63],  Blake  and  Zis- 
serman  [10]  Poggio,  Gamble  and  Little  [79]  and  Geiger  and  Girosi  [26])  which  are 
reviewed  in  some  more  detml  in  section  3.7. 


2.2.7  Depth  from  Motion  using  Optical  Flow 

When  a  scene  moves  relative  to  an  observer  each  scene  point  can  be  assigned  an 
instantaneous  velocity  in  space.  The  projection  of  this  velocity  field  into  the  image 
plane  of  the  observer  is  called  the  motion  field  and  can  be  represented  by  a  vector 
(uij,Vij)  at  every  pixel  location  (t,  j).  A  variety  of  methods  exists  that  estimate  an 
approximation  to  the  motion  field  which  is  typically  referred  to  £ls  the  optical  flow 
from  a  pair  of  images:  Horn  and  Schunck  [47],  Hildreth  [42],  Nagel  and  Enkelmann 
[71],  Heeger  [33],  Anandan  [3].  Two  images  from  a  motion  sequence  along  with  the 
computed  optical  flow  is  shown  in  figure  2.3.  Notice  that  the  optical  flow  estimates 
contain  errors  due  to  a  variety  of  reasons. 

If  an  algorithm  for  the  computation  of  the  optical  flow  is  available  (see  section  8.2 
for  an  example),  we  can  assume  that  the  optical  flow  (u,v)  has  been  computed  for 
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Figure  2.3:  The  first  frame  and  an  optical  flow  field  from  the  pepsi  sequence. 

every  pixel  (i,  j).  The  instantaneous  depth  from  motion  algorithm  uses  the  relation¬ 
ship  between  the  optical  flow  vector  of  a  point  and  the  corresponding  depth  value  Z 
SIS  the  visual  mechanism.  This  relationship  is  given  by  the  Longuet-Higgins/Prazdny 
formulas  [59] 

—  U  4-  x  W 

uij  =  - — ^ - h  Axjpi  -  -H  1)  +  Cyi  (2.7) 

^ij 

-  +  A{y^  +  1)  -  Bxjyi  -  Cxj  (2.8) 

where  t  =  [U,  V,  is  the  relative  translation  between  camera  and  scene,  u;  = 
[A,B,CY'  is  the  relative  rotation  and  (i>,yi)  is  the  physical  image  plane  location  of 
pixel  (i,j). 

We  simplify  these  equations  by  introducing  the  inverse  depth  or  disparity  d  = 
IjZ  (thereby  making  the  problem  linear)  and  abbreviating  the  known  coefficients 
aij  =  —U  XjW  and  0ij  =  —V  +  yjW  as  well  as  the  known  rotational  components 
(which  are  independent  of  depth)  =  Axjyi  —  B{xj  +  1)  +  Cy,  and  v^j  =  A{y^  4- 
1)  —  Bxjyi  —  Cxj.  This  results  in 

Uij  =  Oijdij  4-  u'j  (2.9) 

Vij  =  Bijdij  +  v'j.  (2.10) 

The  surface  reconstruction  objective  is  to  determine  the  disparity  map  (d,-,  )  which 
satisfies  (2.9)  and  (2.10)  when  given  the  optical  flow  field  {uij,Vij).  Mathematically, 
the  problem  is  overdetermined,  as  we  have  2nm  constraints  for  nm  unknowns  so  it 
may  not  be  possible  to  satisfy  all  of  the  constraints.  In  addition,  the  optical  flow  field 
is  known  to  contain  errors  and  to  reduce  their  effect  on  the  reconstructed  surface, 
we  impose  a  smoothness  constraint  on  the  disparity  field  (djj).  This  is  achieved  by 
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constructing  an  energy  function  the  minimum  of  which  is  the  disparity  field  which 
best  reconciles  the  conflicting  objectives 

Jid)=  Z!  (“«i  -  -  ^ij)^  +  (vii  -  ^ijdij  -  +  (2.11) 

»  3 

The  first  two  terms  enforce  the  compatibility  of  d  with  the  measured  optical  flow 
values  according  to  (2.9)  and  (2.10);  the  last  two  terms  penalize  for  large  differences 
in  neighboring  values  of  d  and  thereby  enforce  surface  smoothness. 

The  “optimal”  value  of  is  obtained  by  minimizing  the  functional  J  which 
can  be  done  by  iterative  relaxation  methods  (see  Golub  and  Van  Loan  [30]).  The 
Gauss-Seidel  iteration  equations  in  the  above  case  would  be 

.n+l  _  -  <j)  +  -  t>o  )  + 

where  d,j  =  +  and  n  is  the  iteration  index.  The  above  energy 

functional  method  is  used  widely  in  surface  reconstruction  and  is  also  referred  to  as 
“regularization”  (Terzopoulos  [97],  Poggio  et  al.  [80]).  The  surface  reconstruction 
method  described  here  is  similar  to  the  one  suggested  by  Bruss  and  Horn  [16]  and 
Barron  [7]. 


2.2.8  Depth  from  Motion  without  Optical  Flow 

The  instantaneous  procedure  for  estimating  depth  from  motion  described  in  the  pre¬ 
vious  section  requires  the  computation  of  optical  flow.  Since  this  is  a  computationally 
expensive  procedure,  Horn,  Negahdaripour  and  Weldon  [75],  [48],  [74],  [76]  developed 
a  direct  method  for  the  computation  of  structure  from  image  sequences  which  does 
not  use  the  optical  flow  as  an  intermediate  representation.  It  is  briefly  described 
below. 

The  brightness  constancy  assumption 


(2.13) 


states  that  the  brightness  of  a  fixed  point  in  the  scene  is  unchanged  between  tem¬ 
porally  subsequent  image  frames.  This  assumption  is  not  true  for  most  real  scenes, 
motions  and  lighting  conditions  but  it  is  a  popular  approximation  which  is  also  the 
basis  for  the  estimation  of  optical  flow  fields  (Horn  and  Schunck  [47]).  By  expanding 
the  absolute  derivative  in  (2.13)  we  obtain 
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where  Ex,  Ey,  Et  are  the  brightness  derivatives  in  spatial  and  temporal  directions  and 
(ujv)  is  the  optical  flow  introduced  previously.  Note  that  the  brightness  derivatives 
can  be  approximated  by  taking  flnite  differences  of  image  brightness  values  and  are 
therefore  known  quantities. 

Conceptually,  the  idea  is  as  follows:  The  brightness  change  constraint  equation 

(2.14)  links  brightness  values  to  optical  flow.  The  motion  field  equations  (2.7),  (2.8) 
link  optical  flow  to  rigid  body  motion  and  structure.  By  plugging  (2.7),  (2.8)  into 

(2.14)  we  obtain  one  equation  that  links  image  brightness  values  directly  to  the 
desired  depth  values. 

+  V -w  +  =  0  (2.15) 

Li 

where  t  and  u;  are  the  rigid  body  translation  and  rotation  between  camera  and  scene 
object  and 


■  -fE.  ■ 

Eyf  +  y{xEx  +  yEy)lf 

s  = 

-fEy 

and  V  = 

-Exf  -  x{xEx  +  yEy)lf 

xEx  +  yEy 

yEx  -  xEy 

Note  that  s  and  v  can  be  computed  completely  from  image  brightness  values. 

For  the  purpose  of  surface  reconstruction,  we  transform  the  problem  into  a  linear 
one  by  using  the  disparity  d  =  1/Z  instead  of  the  depth  Z.  Our  objective  is  then 
to  find  a  disparity  surface  which  satisfies  (2.15).  Since  (2.15)  is  expected  to  be 
only  approximately  true  and  since  the  image  brightness  measurements  are  noisy,  we 
impose  the  additional  constraint  that  {dij)  be  smooth  so  as  to  reduce  the  effect  of 
these  errors.  An  energy  functional  is  constructed  which  formalizes  these  objectives 

EE  ((S.,  •  t)i.i  +  .  u.  +  (2.17) 

«  J 

“  <^».>-i)*  +  (di+ij  —  di_i,j)^] 

Again,  the  first  term  enforces  the  compatibility  of  d  with  the  direct  motion  constraint 

(2.15)  and  the  second  term  penalizes  for  large  differences  in  neighboring  values  of  d 
and  thereby  enforces  surface  smoothness. 

The  “optimal”  value  of  dij  is  obtained  by  minimizing  the  functional  J  which  can 
be  done  by  iterative  rel20cation  methods.  The  Gauss-Seidel  iteration  equations  are 

j(n+l)  ~  (S»i  ■  ,o\ 

=  - 4A  ;  (s,,  .  t)> - 


in  which  dij  is  a  local  sum  of  neighbors  of  dij  and  n  is  the  iteration  index. 


2.2:  Instantaneous  Surface  Reconstruction 


31 


2.2.9  Depth  from  Shading 

In  shape  from  shading  (see  Horn  and  Brooks  [46])  the  visual  mechanism  is  a  known 
functional  relationship  between  the  brightness  E  observed  at  a  location  on  the  surface 
and  the  surface  normal  n  =  [— p,  —g,  1]  there 

E  =  R(p,g),  (2.19) 

where  R  is  called  the  reflectance  function  and  p  =  and  g  =  Zy  are  the  partial 
derivatives  of  depth.  Reflectance  functions  have  been  determined  for  a  number  of 
surfaces  such  as  the  lunar  surface  and  Lambertian  surfaces. 

The  objective  is  to  recover  the  surface  structure  Z  from  the  image  brightness 
E  using  the  known  reflectance  properties  (2.19).  As  before  we  introduce  an  energy 
function  on  the  depth  Z 

J(^)=  EE  (2.20) 

•  j 

+  (Zrtw-  - 

which  in  addition  to  enforcing  compatibility  of  Z  with  the  reflectance  function  also 
enforces  smoothness  to  reduce  the  effect  of  errors  in  the  reflectance  model  and  the 
measured  brightness  values.  Note  that  the  partial  derivatives  have  been  approxi¬ 
mated  by  finite  differences  for  the  purpose  of  computation. 

The  iterative  Gauss-Seidel  solution  to  the  above  minimization  problem  is  given 
by 


yn+l 


where 


+  (2.21) 


Rij  =  R{Zij+i  —  Zij^i,  Zi+ij  —  Zi^ij)  (2.22) 

Zi+i,j  —  Zi-ij)  (2.23) 

=  Rg(Zij+i  —  Zij_i,  Zi+ij  —  Zi^ij)  (2.24) 

Zij  =  (Zij+i  -f-  -f-  -I-  Zi^ij)/A  (2.25) 


As  the  reflectance  function  R  is  usually  highly  nonlinear,  the  energy  function  (2.20) 
can  have  many  local  minima  and  the  convergence  of  the  iterative  scheme  is  a  concern; 
it  has  been  addressed  by  alternative  formulations  of  the  energy  function  (see  Horn 
and  Brooks  [46]). 
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2.2.10  Summary 

In  this  chapter  we  have  seen  several  examples  of  how  visual  mechanisms  such  as 
motion  and  shading  (an  example  for  stereo  was  given  in  the  introduction  chapter  1) 
can  be  used  to  reconstruct  the  three-dimensional  structure  of  a  scene  represented  in 
terms  of  a  depth  or  disparity  map.  We  have  seen  that  energy  functionals  not  only 
provide  a  simple  way  to  formalize  the  visual  constraint  on  the  surface  but  also  allow 
to  incorporate  additional  prior  information  such  as  surface  smoothness.  Although 
the  dense  representation  of  a  surface  in  depth/ disparity  maps  is  not  the  only  one 
possible,  it  elegantly  complements  the  energy  functional  method  and  contains  the 
maximum  amount  of  information  which  we  can  hope  to  recover  from  the  images. 


Related  Work 


Chapter  3 


This  section  provides  an  overview  over  previous  work  that  addresses  the  problem 
of  reconstructing  three-dimensional  scenes  from  images  in  a  temporad  framework.  As 
we  have  pointed  out  in  the  introductory  chapter  1,  three  attributes  are  useful  in 
characterizing  the  variety  of  approaches  that  have  been  proposed: 

•  Representations 

•  Visual  Mechanisms 

•  Algorithms 

We  have  already  explored  different  visual  mechanisms  that  can  be  used  for  the  re¬ 
construction  of  three-dimensional  structure  in  chapter  2.  Most  previous  work  in  3D 
reconstruction  is  linked  to  a  particular  visual  mechanism  although  it  has  rarely  ever 
been  investigated,  whether  a  particular  representation  or  algorithm  is  better  suited 
for  reconstruction  using  a  given  visual  mechanism.  We  will  attempt  to  gain  some 
insight  into  the  different  alternatives  that  have  been  proposed  while  realizing  that  it 
may  be  impossible  to  obtain  an  absolute  and  objective  answer  to  questions  of  best 
representation  and  best  algorithm. 


3.1  Representations 

Representations  for  the  three-dimensioned  structure  of  reconstructed  scenes  fall  into 
two  categories 

1.  Dense  representations.  They  usually  involve  some  tesselation  of  the  scene  or 
surface  area  and  structural  information  for  each  tesselation  unit. 

2.  Sparse  representations.  Structural  information  is  only  provided  at  selected 
locations. 
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For  a  more  complete  discussion  of  representation  alternatives  for  three-dimensional 
shapes  see  Faugeras  [22]. 

Dense  representations  such  as  depth  maps,  voxels,  surface  triangulations  etc. 
have  the  advantage  of  representing  details  of  the  surface  structure  and  being  well- 
suited  to  representing  the  dense  information  that  is  available  &om  images.  On  the 
other  hand,  the  computational  burden  is  higher  for  dense  representations.  But  most 
importantly,  while  most  visual  mechanisms  such  as  stereo  and  motion  can  provide 
dense  information  about  surface  structure,  they  provide  useful  information  only  at 
a  subset  of  all  locations  in  the  image  plane  (for  example  near  edges  or  in  areas  of 
significant  texture). 

Sparse  representations  include  feature  points,  line/edge  segments,  polyhedral  ap¬ 
proximations  and  super-quadrics.  In  all  cases,  the  representation  is  restricted  to 
certain  selected  locations  of  the  three-dimensional  scene.  This  has  the  advantage  of 
considerably  reducing  the  computational  complexity.  On  the  other  hand  it  reqtiires 
the  locations  of  the  sparse  structure  representations  to  be  identified  by  some  other 
computational  mechanism  (which  may  be  rather  complex  itself).  In  addition,  the 
sparse  representation  not  contain  sufficient  information  to  perform  any  useful 
tasks  based  on  it  (su<"h  as  navigation  or  recognition). 

In  comparing  these  representation  alternatives,  we  can  think  of  a  sparse  repre¬ 
sentation  as  a  subset  of  a  dense  representation.  For  example  a  feature-point  based 
depth  estimator  will  retrieve  structure  information  at  selected  image  plane  coordi¬ 
nates  while  a  depth  map  based  estimator  will  obtain  information  at  every  image 
plane  location.  Ideally,  the  dense  representation  would  contain  the  same  information 
as  the  sparse  representation  at  the  feature  point  locations.^  This  thought  can  be 
carried  further:  if  we  restrict  ourselves  to  certain  image  locations  because  we  believe 
that  useful  information  can  be  recovered  only  there,  we  should  have  some  notion  of 
what  “useful”  means.  We  are  implying,  that  structure  information  recovered  at  any 
image  location  has  a  certain  “usefulness”  and  in  selecting  a  sparse  representation  we 
restrict  ourselves  to  the  locations  where  this  utility  index  exceeds  a  certain  threshold. 
As  a  consequence,  we  can  reconcile  sparse  and  dense  representations  by  maintaining 
a  measure  of  uncertainty  along  with  each  estimate  in  a  dense  representation.  Then 
the  sparse  representation  can  be  extracted  at  any  time  by  suitably  thresholding  the 
uncertainty  index. 

When  the  temporal  aspect  of  surface  reconstruction  is  considered,  an  additional 
issue  gains  relevance  in  comparing  representations.  In  order  to  combine  structure 
information  across  image  measurements  from  several  frames,  it  must  be  possible  to 
determine  the  correspondence  of  structure  information  between  frames.  For  sparse 
representations,  this  correspondence  must  be  established  explicitly  and  can  involve  a 
rather  complex  matching  procedure.  Dense  representations  have  an  advantage  here, 

^This  is  usually  not  the  case,  since  dense  algorithms  often  involve  some  influence  between  spa¬ 
tially  “neighboring”  pieces  of  structure  information. 
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since  structure  information  is  available  everywhere  and  a  corresponding  surface  point 
in  another  frame  can  be  identified  by  a  simple  geometric  calculation. 


3.2  Algorithms 

The  computational  procedures  that  are  used  to  perform  the  three-dimensional  re¬ 
construction  task  in  a  temporal  fashion  fall  into  one  of  the  folloAving  two  categories: 

1.  Incremental  Algorithms:  With  each  new  frame  of  image  information  that  be¬ 
comes  available  a  current  representation  of  the  three-dimensional  structure  is 
updated. 

2.  Batch  Algorithms:  A  given  number  of  frames  of  image  information  are  accu¬ 
mulated  before  the  processing  begins. 

Incremental  algorithms  have  a  practical  advantage  in  that  they  require  only  the 
latest  frame  of  image  information  to  be  present  in  memory  at  any  given  time.  On  the 
other  hand,  the  incrementally  improved  estimate  after  n  frames  may  not  necessarily 
be  the  best  one  that  could  be  obtained,  if  all  n  frames  were  available. 

Batch  algorithms  attempt  to  achieve  the  maximal  estimate  quality  by  processing 
all  available  frames  together  and  determining  the  values  for  the  structure  parameter 
that  best  fit  the  entire  sequence.  To  do  this,  however,  storage  for  all  of  the  frames 
of  image  information  is  required. 

As  a  consequence,  a  decision  for  a  particular  type  of  algorithm  will  depend  on 
two  factors:  the  type  of  representation  of  three-dimensional  structure  we  are  using 
and  the  application  in  which  the  structure  information  is  to  be  employed.  A  dense 
structural  representation  will  favor  an  incremental  algorithm,  as  it  is  impractical  to 
store  dense  image  information  for  any  sequence  of  significant  length.  For  a  sparse 
set  of  feature  points,  on  the  other  hand,  storage  may  not  be  a  consideration  at 
ail.  Real-time  applications  such  as  navigation  require  structure  information  to  be 
available  and  updated  at  any  time  which  points  towards  incremental  procedures.  On 
the  other  hand,  a  recognition  task  that  operates  on  three-dimensional  information 
may  allow  an  entire  sequence  of  frames  to  be  accumulated  before  processing  begins. 

From  a  practical  point  of  view,  two  additional  considerations  deserve  attention: 
First,  video  image  acquisition  devices  provide  a  continuous  stream  of  images  that 
is  not  limited  in  length  at  the  outset.  This  fact  does  not  rule  out  the  use  of  batch 
methods,  since  we  could  partition  the  stream  into  groups  of  frames  that  a  batch 
method  could  process.  It  does  raise  the  question,  however,  whether  the  results 
obtained  from  one  group  of  frames  can  be  carried  over  to  the  processing  of  the  next 
group:  the  very  same  problem  we  faced  in  extending  two-frame  algorithms  to  image 
sequences.  Second,  batch  procedures  typically  not  only  have  high  storage  demands 
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but  also  involve  rather  complex  numerical  optimization  procedures,  in  particular  for 
dense  representations. 

Ideally,  we  would  like  to  find  an  incremental  method  that  yields  the  same  estimate 
of  three-dimensional  structure  parameters  2ifter  n  frames  as  a  batch  procedure  that 
is  run  on  the  same  n  frames.  One  incremental  procedure,  the  Kalman  filter,  which  is 
also  used  in  this  thesis,  is  known  to  have  this  property  under  certain  circumstances 
which  will  be  explored  in  more  detail  in  chapter  4.  While  these  preconditions  are 
rarely  met  in  practice,  the  recursive  estimator  may  constitute  a  compromise  that 
enjoys  the  computational  benefits  of  an  incremental  algorithm  while  providing  esti¬ 
mates  that  are  comparable  in  quality  to  batch  procedures. 


3.3  Incremental  Algorithms  for  Sparse  Represen¬ 
tations 

The  first  incremental  algorithm  that  used  a  sparse  represenstion  was  Ullman’s  “in¬ 
cremental  rigidity  scheme”  [103],  [104].  This  method  estimates  the  three-dimensional 
coordinates  of  a  set  of  points  that  are  identified  as  features  in  a  sequence  of  monocular 
frames.  The  matching  of  points  across  frames  is  achieved  by  determining  the  trans¬ 
formation  and  the  correspondences  that  minimize  the  amount  of  distortion  (maxi¬ 
mize  the  rigidity)  of  the  rigid  body  on  which  the  feature  points  lie.  The  algorithm 
uses  orthographic  projection  and  will  therefore  fail  for  any  motion  with  components 
along  the  optical  axis.  In  addition,  it  can  be  shown  that  minimizing  model  distortion 
may  not  lead  to  convergence  to  the  true  three-dimensional  shape.  Nevertheless,  this 
work  first  identified  the  importance  of  recovering  structure  information  in  a  tempo¬ 
ral  framework,  it  first  proposed  the  use  of  an  iterative  method  and  it  established 
important  guidelines  for  later  work  in  this  domain. 

Broida  and  Chellappa  [13],  [14]  first  suggested  the  use  of  Kalman  filtering,  an 
incremental  stochastic  estimation  procedure,  to  obtain  feature  point  location  esti¬ 
mates  and  improve  them  over  time.  A  set  of  point  features  are  matched  over  a 
sequence  of  frames  and  estimates  of  their  corresponding  three-dimensional  locations 
are  maintained  with  an  extended  Kalman  filter.  The  filter  formulation  also  includes 
the  camera  motion  parameters  in  the  filter  state.  The  important  relationship  between 
the  incremental  and  the  batch  solution  to  the  estimation  problem  are  addressed  in 
this  work  for  the  first  time.  While  the  early  work  applied  the  recursive  estimator  to 
the  feature  point  estimation  problem  in  a  straightforward  way,  which  led  to  a  highly 
nonlinear  filter,  the  later  work  in  conjunction  with  Chandrashekhar  [12],  [18]  seeks  to 
simplify  both  the  dynamical  model  and  the  measurement  equations.  The  conclusions 
are  very  similar  to  the  ones  in  the  work  of  Faugeras  et  al.  and  to  the  ones  in  this 
thesis;  a  suitable  choice  of  the  state  and  measurement  quantities  can  greatly  reduce 
the  complexity  of  the  estimation  procedure  and  improve  the  quality  of  the  estimates. 
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The  INRIA  vision  group  has  pioneered  the  use  of  the  Kalman  filter  for  the  in¬ 
cremental  estimation  of  parameters  for  three-dimensional  geometric  features  such  as 
lines  and  planes.  Faugeras  et  al.  [23]  used  an  extended  Kalman  filter  to  obtain  an 
estimate  of  the  three-dimensional  location  of  line  segments  observed  in  pairs  of  stereo 
images.  Ayache  and  Faugeras  [4]  performed  the  same  task  using  a  trinocular  stereo 
system.  Faugeras  and  Lustman  [24]  showed  that  finding  corresponding  features  be¬ 
tween  frames  which  is  an  essential  part  of  any  feature-based  temporal  algorithm  is 
considerably  simplified  if  the  scene  can  be  assumed  to  be  piecewise  planar.  More  re¬ 
cently,  Jezouin  and  Ayache  [52]  investigated  the  tradeoffs  of  tracking  and  estimating 
point  and  line  segment  features  in  the  image  plane  versus  in  three  dimensions  using 
an  Extended  Kalman  filter.  It  is  possible  to  trade  off  complexity  of  the  measurement 
model  for  complexity  of  the  matching  procedure.  Navab  et  al.  [72]  explore  how  stereo 
measurements  of  line  segments  can  be  combined  in  a  Kalman  filter  framework  with 
information  obtained  by  tracking  points  on  the  segments  over  time  to  produce  esti¬ 
mates  of  three-dimensional  position.  In  applying  a  Kalman  filter  to  the  estimation  of 
line  segment  features,  two  difficulties  arise:  First,  the  line  segments  must  be  matched 
from  one  frame  to  the  next,  in  order  to  combine  the  information  that  is  contained 
in  these  measurements.  This  is  not  a  simple  tcisk,  as  the  above  work  has  revealed. 
Second,  a  number  of  the  prerequisites  for  the  application  of  the  Kalman  filter  such 
as  a  linear  measurement  model,  non-correlation  between  measurement  and  state  and 
non-correlation  between  measurements  are  often  not  guaranteed  for  the  proposed 
models.  Extensive  experimentation,  however,  seems  to  show  that  these  concerns  can 
be  overcome  in  practice. 

Bharwani,  Riseman  and  Hanson  [8]  use  correspondence  at  the  brightness  level  to 
identify  matching  feature  points  in  monocular  sequences.  The  interesting  feature  of 
this  approach  is  that  the  depth  values  estimated  from  feature  matches  in  all  frames 
up  to  a  given  time  are  used  to  predict  the  location  of  the  feature  in  the  next  frame. 
This  greatly  reduces  the  search  effort  for  the  new  correspondence,  but  it  does  not 
incorporate  the  past  estimates  into  the  new  and  most  current  one;  it  only  uses  them 
as  a  starting  point. 

Dickmans  [20],  [21]  and  Zapp  [111]  proposed  a  Kalman  filtering  based  algorithm 
for  the  analysis  of  monocular  image  sequences  which  he  termed  “4D  dynamic  scene 
interpretation”.  The  outstanding  feature  of  this  approach  is  that  it  was  implemented 
on  specially  designed  hardware  and  successfully  used  for  the  autonomous  guidance 
of  vehicles.  The  algorithm  extracts  the  location  of  the  road  boundary  within  several 
small  rectangular  windows  as  features  and  maintains  an  estimate  of  the  vehicle  po¬ 
sition  relative  to  the  edge  of  the  road  with  the  help  of  a  recursive  estimator.  This 
estimator  feeds  directly  into  a  control  system  for  the  navigation  of  the  vehicle.  Since 
the  algorithm  is  implemented  and  integrated  into  a  functioning  system,  many  of  the 
philosophical  issues  concerning  representations  and  algorithms  are  resolved  against 
the  ultimate  criterion:  it  works. 
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Wu  et  al.  [108]  use  an  extended  Kalman  filter  to  estimate  the  three-dimensional 
location  of  a  set  of  feature  points  as  well  as  the  camera  displacement  between  frames. 
The  particular  regularity  of  the  object  used  in  the  experiments  (a  calibration  grid) 
and  the  restriction  to  small  motions  over  a  short  sequence  greatly  simplifies  the 
necessary  matching  step  which  is  the  key  difRculty  for  sparse  representations  as  we 
have  seen. 

Korsten  [57]  investigates  the  estimation  of  rigid  body  parameters  such  as  plane 
normals  from  image  sequences  using  a  deterministic  least-squares  estimation  pro¬ 
cedure.  This  can  be  considered  as  a  deterministic  version  of  the  Kalman  filtering 
algorithm  used  in  this  thesis.  Since  the  measurement  model  relating  image  bright¬ 
ness  to  the  desired  parameters  is  nonlinear,  a  linearization  about  the  current  estimate 
is  performed  for  each  new  frame  similar  to  the  Extended  Kalman  Filter. 

Sobh  and  Wohn  [86]  presented  an  incremental  algorithm  that  estimates  the  pa¬ 
rameters  of  a  planar  surface  from  optical  flow  fields.  The  temporal  integration  was 
achieved  by  computing  a  weighted  average  between  the  parameters  estimated  from 
the  current  flow  field  and  the  average  of  aU  previous  fields. 

Kalldahl  [53]  uses  a  set  of  features  that  have  been  identified  and  matched  over 
a  sequence  of  monocular  frames.  He  then  uses  two  interleaved  recursive  estimators 
to  update  an  estimate  for  both  the  three-dimensional  locations  of  the  features  and 
the  relative  camera  motion  parameters.  The  interleaved  approach  is  similar  to  the 
one  outlined  previously  in  [36].  Due  to  the  interleaved  scheme,  all  theoretical  results 
concerning  recursive  estimators  are  not  applicable  and  the  convergence  cannot  be 
guaranteed. 

Maybank  [68]  investigated  alternative  incremental  schemes  for  the  estimation  of 
feature  locations  from  a  sequence  of  monocular  images.  He  compared  the  extended 
Kalman  filter  (see  the  work  of  Faugeras  et  al.  and  Broida  and  Chellappa  above)  which 
uses  a  first  order  Taylor  series  approximation  to  the  nonlinear  measurement  equations 
to  a  scheme  in  which  second  order  derivatives  of  the  Taylor  series  are  included. 
For  a  simple  one- dimensional  example  it  was  demonstrated  that  the  second-order 
approximative  filter  produces  superior  estimates. 


3.4  Batch  Algorithms  for  Sparse  Representations 

lu  and  Wohn  [51]  use  the  tracked  locations  of  a  feature  point  in  the  image  throughout 
a  sequence  of  frames.  The  location  of  the  corresponding  three-dimensional  surface 
point  is  modeled  with  a  truncated  Taylor  series,  the  coefficients  of  which  are  esti¬ 
mated  from  the  feature  locations  in  a  least-squares  fashion. 

Spetsakis  and  Aloimonos  [87]  propose  an  interesting  approach  based  on  physical 
intuition.  If  observations  of  a  set  of  feature  points  are  available  throughout  a  sequence 
of  frames,  the  rays  through  these  feature  points  will  usually  not  all  go  through  the 
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same  point  in  three  dimensions  due  to  noise.  We  can  hypothesize  a  set  of  matches 
and  assign  an  error  to  each  match  which  is  proportional  to  the  distance  between 
the  rays  that  should  actually  coincide  in  one  point  (as  if  they  had  springs  attached 
between  them).  An  optimal  set  of  matches  and  consequently  the  three-dimensional 
feature  point  locations  can  be  found  by  minimizing  the  overall  spring  energy.  This 
minimization  involves  the  solution  of  an  eigenvalue  problem  for  a  matrix  with  O(n^) 
elements  where  n  is  the  number  of  fr<imes.  Consequently,  this  can  be  rather  complex 
in  practice  and  experimental  evaluation  with  real  images  is  necessary. 

Shahriat  and  Price  [85]  use  a  single  point  feature  that  has  been  identified  and 
matched  across  a  sequence  of  monocular  frames.  They  show  that  the  translational 
component  of  interframe  motion  can  be  eliminated  if  the  five  frames  are  used.  The 
resulting  equations  can  be  solved  for  the  rotational  motion  component  using  a  non¬ 
linear  least  squares  procedure.  Then  structure  information  can  be  obtained  for  other 
features  on  the  same  rigid  surface.  The  key  question  of  how  the  feature  matching  is 
achieved  across  frames  is  not  addressed  in  this  work.  It  is  also  left  open,  how  more 
than  one  feature  or  the  use  of  more  than  five  frames  could  help  in  reducing  the  effect 
of  errors  in  the  matching  and  thereby  improve  the  quality  of  the  estimates. 

Sawhney  and  Oliensis  [82]  provide  an  interesting  solution  to  the  case  in  which  a 
set  of  feature  points  is  obtained  from  a  monocular  sequence  of  a  rotating  rigid  object. 
Under  these  circumstances,  the  trajectory  of  the  feature  points  in  the  image  plane  is 
shown  to  be  a  conic  section.  A  conic  section  is  fit  to  the  observed  features  in  a  least- 
squares  sense  and  the  corresponding  three-dimensional  conic-section  is  computed. 
This  involves  the  solution  to  an  eigenvalue  problem  for  a  three-dimensional  matrix. 
Although  the  central  problem  of  feature-point  correspondence  is  not  addressed  in 
this  work,  results  from  real  images  and  the  computational  simplicity  indicate  that 
this  algorithm  can  be  very  useful  in  practice. 

Recently,  Tomasi  and  Kanade  [99],  [100]  described  a  mathematically  elegant  and 
computationally  simple  method  for  the  estimation  of  structure  from  a  set  of  matched 
feature  points  in  a  monocular  sequence.  They  showed  that  the  observed  feature  point 
locations  are  the  result  of  multiplying  two  matrices,  one  describing  the  scene  structure 
and  the  other  describing  camera  motion  between  frames.  Both  can  be  obtained  by  a 
singular  value  decomposition.  This  work  is  ongoing  and  current  limitations  such  as 
the  orthographic  projection  model  and  the  batch  nature  of  the  processing  are  being 
addressed. 

3.5  Incremental  Algorithms  for  Dense  Represen¬ 
tations 

Matthies,  Szeliski  and  Kanade  [66],  [67]  independently  developed  a  Kalman  filter 
based  algorithm  for  the  dense  estimation  of  depth  from  a  monocular  motion  sequence 
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that  is  very  similar  to  work  presented  in  this  thesis  (originally  [35]).  Using  the  optical 
flow  as  a  measurement  and  the  inverse  depth  as  the  state,  structure  estimates  of 
unprecedented  quality  were  recovered  from  real  images  for  pure  translation  of  the 
camera  perpendicular  to  the  optical  axis.  The  theoretical  foundation  was  expanded  to 
allow  for  more  general  motions.  This  work  was  instrumental  in  introducing  recursive 
estimation  for  dense  structure  representations  to  the  vision  community.  Matthies 
[64]  later  applied  the  Bayesian  estimation  framework  to  the  dense  estimation  of 
depth  from  sequences  of  stereo  images.  His  thesis  thoroughly  evaluates  the  method 
from  both  a  theoretical  and  an  experimental  point  of  view  which  prompted  me  to 
exclude  stereo  as  a  visual  mechanism  from  my  own  investigations.  One  shortcoming 
of  this  work,  however,  is  the  prediction  stage  of  the  Alter,  which  achieves  temporal 
consistency  of  estimates.  The  Alter  design  requires  the  motion  of  the  stereo  cameras 
to  be  restricted  to  a  translation  along  a  line  perpendicular  to  the  optical  axis. 

Szeliski  [95],  [94]  explor^^d  a  volumetric  dense  representation  for  three-dimensional 
structure.  Octrees  represent  tesselations  of  space  in  which  a  unit  can  either  be 
occupied  or  empty.  Szeliski  showed  how  such  a  representation  can  be  used  in  a 
the  recursive,  Bayesian  estimation  of  structure  from  optical  flow  Aelds  and  from  the 
silhouette  of  a  rotating  object. 


3.6  Batch  Algorithms  for  Dense  Representations 

Tsai  [101]  positioned  a  camera  at  eight  Axed  locations  in  a  plane  and  used  the 
resulting  images  to  calculate  a  dense  depth  map  by  extending  the  stereo  principle 
to  multiple  images.  He  shows  both  theoreticcdly  and  through  simulation  that  the 
quality  of  the  resulting  surface  estimate  is  improved  considerably  as  compared  to  the 
case  in  which  just  two  frames  are  used. 

Holies  and  Baker  [11]  and  later  Yamamoto  [109]  introduced  the  concept  of  the 
epipolar  image.  In  this  approach,  the  monocular  images  acquired  by  a  camera  trans¬ 
lating  perpendicular  to  its  optical  axis  were  collected  into  a  ’’volume”  of  images. 
Slices  through  this  volume  in  the  temporaJ  direction  can  be  analyzed  without  estab¬ 
lishing  correspondences  and  provide  information  not  only  about  rigid  body  motion 
and  structure  but  also  about  occlusion  and  segmentation.  More  recently  [6]  Baker 
and  Holies  have  generalized  this  method  to  handle  more  complex  motions  and  to 
work  incrementally  as  new  images  become  available.  In  this  case,  the  method  be¬ 
comes  similar  to  previously  mentioned  Kalman  Altering  methods. 

Subbarao  [92],  [91]  estimated  surface  structure  from  measurements  of  optical  flow 
over  time  by  interpreting  both  the  depth  and  the  optic2d  flow  values  as  functions  of 
time  and  locally  approximating  these  functions  by  flrst  and  second  order  Taylor  series. 
Under  these  assumptions,  the  structure  parameters  can  be  determined  in  closed  form. 
The  practicality  of  this  approach  is  in  question,  however,  since  it  involves  spatial  and 
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temporal  derivatives  of  the  optical  flow  and  experiments  with  real  images  were  not 
presented. 

Schott  [83]  investigated  how  shading  and  motion  could  be  used  for  the  dense  re¬ 
covery  of  structure  information  in  a  monocular  sequence.  The  central  idea  was  the 
extension  of  the  brightness  constancy  assumption  of  Horn  and  Schunk  [47]  by  a  term 
that  modeled  the  brightness  change  due  to  shading.  A  second  major  contribution  was 
the  formulation  of  a  least-squares  problem  that  used  the  enhanced  motion/shading 
constraint  equation  to  recover  structure  information.  This  involved  the  warping  of 
image  information  across  the  sequence  so  as  to  compensate  for  interframe  motion,  a 
procedure  related  to  the  prediction  stage  of  Kalman  filter  based  algorithms.  To  solve 
the  nonlinear  least-squares  problem  complex  numerical  optimization  procedures  were 
employed.  The  restriction  to  simple  shading  models  reduced  the  practical  applica¬ 
bility  of  the  algorithm. 


3.7  Other  Related  Work 

This  section  summarizes  previous  work  that  does  not  deal  with  the  estimation  of 
structure  information  in  a  temporal  manner  but  contains  representations,  algorithms 
or  experiments  that  put  temporal  surface  reconstruction  into  perspective. 

StuUer  and  Krishnamurthy  [90]  were  among  the  first  to  use  Kalman  filtering 
for  the  estimation  of  visual  information  from  image  sequences.  In  their  case,  the 
optical  flow  wais  the  state  of  a  dynamical  system,  the  measurements  were  the  image 
brightness  values  themselves.  Although  the  Jissumptions  used  in  this  work  were  fairly 
restrictive  such  as  a  locally  planar  approximation  of  the  image  brightness  function,  it 
helped  to  introduce  recursive  estimation  to  the  vision  community.  Rougee  et  al.  [81] 
designed  a  Kalman  filter  that  measured  the  normal  component  of  the  optical  flow 
using  the  brightness  constancy  constraint  and  estimated  the  full  flow  vector.  This 
approach  was  shown  to  significantly  improve  the  results  obtained  by  Hildreth  [42] 
on  only  two  frames.  More  recently.  Black  and  Anandan  [9]  proposed  a  new  energy 
function  based  approach  to  optical  flow  estimation  that  achieved  temporal  coherence 
by  computing  a  weighted  average  between  estimates  in  sequential  frames. 

Sethi  and  Jain  [84]  addressed  the  problem  that  most  of  the  feature-based  ap¬ 
proaches  described  above  seek  to  avoid:  how  to  establish  correspondence  between 
features  over  a  sequence  of  frames.  The  algorithm  is  based  on  the  assumption  that 
trajectories  are  smooth  and  a  utility  function  which  measures  trajectory  coherence 
is  established.  A  greedy  algorithm  is  then  used  to  assign  feature  points  to  trajecto¬ 
ries.  This  algorithm  requires  the  feature  points  to  be  available  for  all  frames  before 
processing  begins  and  can  therefore  be  classified  as  a  batch  approach. 

Franzen  [25]  introduced  the  concept  of  chronogeneous  coordinates  in  which  a 
homogenous  coordinate  representation  of  a  three-dimensional  point  is  extended  to 
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contain  time  as  sin  additional  component.  Although  the  advantage  of  this  represen¬ 
tation  in  accomplishing  the  goal  of  scene  reconstruction  was  not  demonstrated  in  the 
paper,  it  is  an  interesting  framework  that  may  help  put  temporal  estimation  schemes 
in  perspective. 

Thompson  and  Kearney  [98]  suggested  that  geometric  representations  of  three- 
dimensional  structure  may  be  both  difficult  to  compute  and  unnecessary  for  most 
tasks  that  require  vision.  Their  proposal  for  "inexact  vision”  called  for  research  on 
qu2Llitative  representations  of  structure,  specifically  structure  boundaries,  time  to 
collision  and  direction  of  translation.  Weinshall  [106]  recently  presented  results  of 
such  research  in  which  it  was  demonstrated,  that  information  about  the  Gaussian 
surface  curvature  can  be  extracted  in  a  simple  manner  from  stereo  disparities  or 
optical  flow  vectors  and  can  be  used  to  classify  surfaces. 

The  probabilistic  modeling  of  surfaces  for  the  purpose  of  visual  reconstruction 
has  an  extensive  history.  Geman  and  Geman  [29]  first  showed  how  a  constraint 
on  a  surface  modeled  by  an  energy  function  (such  as  surface  smoothness)  can  be 
understood  as  a  probabilistic  Markov  random  field  model  with  the  help  of  Gibb’s 
distribution.  In  the  MRF  model,  probabilities  are  assigned  to  configurations  of  values 
of  the  field  representing  the  depth  map  in  a  limited  neighborhood.  Marroquin  [62] 
explored  how  this  probabilistic  model  could  be  used  for  the  reconstruction  of  surfaces 
from  visual  information  while  imposing  a  smoothness  constraint  on  the  surface  but 
also  allowing  for  discontinuities.  Poggio  et  al.  [79]  showed  how  this  framework  could 
be  used  to  achieve  the  integration  of  information  from  different  visual  mechanisms. 
The  major  disadvantage  of  the  MRF  approach  was  the  enormous  computational 
demand  of  the  simulated  annealing  procedure  required  for  the  computation  of  the 
MAP  surface  estimate.  The  graduated  non-convexity  scheme  of  Blake  and  Zisserman 
[10]  and  the  mean-field  approximation  of  Geiger  and  Girosi  [27]  were  efforts  directed 
at  reducing  this  computational  burden. 

Biilthoff  and  Fahle  [17]  provide  evidence,  that  a  Bayesian  framework  in  which 
measurements  are  used  to  obtain  an  estimate  according  to  the  measurement  uncer¬ 
tainty  is  compatible  with  observations  made  about  the  binocular  perception  of  depth 
in  humans. 

Probabilistic  models  such  as  the  Markov  random  fields  and  the  Kalman  filter  used 
in  this  thesis  are  based  on  the  premise  that  stochastic  modeling  is  possible  and  that 
the  estimate  that  maximizes  the  likelihood  of  coinciding  with  the  true  value  should 
be  chosen.  But  as  we  have  seen,  most  stochastic  models  are  difficult  to  obtain  and 
approximate  at  best.  Dengler  [19]  therefore  proposes  a  new  objective  called  the 
minimum  description  length  criterion.  The  basic  idea  is  that  the  best  representation 
is  the  one  which  is  most  concise.  Promising  experimental  results  are  presented  for 
the  application  of  this  idea  to  the  estimation  of  optical  flow. 


3.8:  Situation  of  this  Thesis 
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3.8  Situation  of  this  Thesis 

This  thesis  presents  an  incremental  algorithm  for  a  dense  structural  representation. 
More  precisely,  a  Kalman  filter  based  algorithm  is  used  to  improve  estimates  of  a 
depth  map  over  a  sequence  of  frames  of  arbitrary  length.  The  motives  for  these 
choices  of  representation  and  algorithm  resulted  from  the  observations  made  in  sec¬ 
tions  3.1  and  3.2  and  the  survey  of  previous  work: 

•  The  dense  representation  was  perceived  as  a  superior  to  a  sparse  represen¬ 
tation  since  the  temporal  correspondence  problem  is  reduced  to  a  geometric 
computation.  In  addition,  when  uncertainty  is  explicitly  modeled,  a  sparse 
representation  can  be  regarded  as  a  subset  of  a  dense  model. 

•  Only  an  incremental  algorithm  is  practiced  for  the  depth  map  representation 
both  under  storage  and  computational  considerations.  In  addition,  the  recur¬ 
sive  estimator  can  be  expected  to  perform  very  close  to  any  batch  procedure. 

The  methodology  and  results  presented  in  this  thesis  are  the  result  of  a  series 
of  research  efforts  by  the  author.  I  have  developed  algorithms  that  use  a  variety  of 
visual  mechanisms  for  the  dense  estimation  of  structure  represented  as  a  depth  map. 
They  reflect  the  stages  in  which  the  results  presented  in  this  thesis  evolved.  In  [35], 
[37],  [36]  a  Kalman  filter  based  algorithm  for  the  dense  estimation  of  structure  from 
monocular  optical  flow  was  described.  The  algorithm  would  also  provide  an  estimate 
of  camera  motion  by  alternately  estimating  structure  with  the  recursive  estimator 
and  motion  with  a  least-squares  method.  The  next  step  [39],  [38]  was  to  apply  the 
recursive  estimation  framework  to  the  reconstruction  of  surfaces  from  monocular 
sequences  without  the  use  of  optical  flow.  This  work  built  on  the  “direct”  method  of 
Negahdaripour,  Weldon  and  Horn  [75],  [48]  and  produced  the  first  structure  estimates 
from  real  images  for  this  method.  Up  to  this  point,  all  of  the  algorithms  were 
designed  to  process  depth  estimates  at  each  pixel  separately  and  neglect  correlations 
for  reasons  of  computational  simplicity.  Influenced  by  the  work  of  Szeliski  [93], 
the  approach  was  reformulated  in  [41]  and  [40]  to  incorporate  prior  models  of  surface 
structure  directly  into  the  filtering  algorithm  and  to  use  visual  cues  other  than  motion 
for  the  reconstruction  process. 

The  tempored  surface  reconstruction  framework  most  notably  distinguishes  itself 
from  other  work  in  the  incremental  reconstruction  of  dense  representations  in  the 
following  points 

•  It  has  been  formulated  and  implemented  for  several  visual  mechanisms,  not 
only  motion.  This  work  shows,  that  any  instantaneous  energy-function  based 
reconstruction  process  can  be  embedded  into  the  temporal  framework  presented 
here. 
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•  Prior  models  of  surface  structure  are  directly  incorporated  into  the  filtering 
process.  Previous  work  had  included  a  ‘‘smoothing”  step  inbetween  the  update 
and  prediction  parts  of  the  filter  with  both  theoretical  and  practical  disadvan¬ 
tages. 

•  The  prediction  stage  of  the  filter  is  entirely  new  and  requires  no  more  restric¬ 
tions  on  the  motion  the  camera  may  undergo  between  frames. 


Recursive  Estimation  Theory 

Chapter  4 


In  this  chapter  I  will  briefly  summarize  the  relevant  details  of  recursive  estimation 
theory  and  how  it  applies  to  problems  in  visual  surface  estimation.  For  details  the 
reader  is  referred  to  Gelb  [28],  Brown  [15],  Willsky  [107]  and  the  dynamical  systems 
literature. 

4.1  Linear  measurement  filter 


System 


Measurement 


Figure  4.1:  A  block  diagram  of  a  linear  dynamical  system. 

In  1960,  Kalman  [54]  formulated  a  solution  to  the  following  estimation  problem: 
We  are  given  a  dynamical  system 

Xfc+i  =  AfcXfc  +  Wfc  (4.1) 

y*  =  CfcXfc  +  Vfc  (4.2) 

where  w*  ~  A^(0,Q^)  and  v*  ~  Ar(0,Rfc).  x  is  referred  to  as  the  state  of  the 
dynamical  system,  y  is  called  the  measurement.  State  and  measurement  noise  must 
be  uncorrelated,  i.e.  E[vfcW^]  =  0.  The  block  diagram  in  flgure  4.1  depicts  the 
functionality  of  the  dynamical  system  model. 
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Informally  stated,  we  have  a  system  that  is  determined  by  a  state  vector  x  that 
changes  over  time  as  described  by  (4.1).  While  x  determines  the  behavior  of  the 
system,  only  the  measurement  vector  y  is  measurable  and  its  relationship  to  the 
state  is  given  by  (4.2). 

The  objective  is  to  estimate  x^  given  the  measurements  yi,  and  the  above  dy¬ 
namics  of  the  system.  In  the  Kalman  filter  the  estimate  St  is  obtained  by  repeating 
the  follovring  two-step  process  at  each  time  k: 

Update: 


Kfc  =  PiCl[C,PiCl  +  R,]-^  (4.3) 

Xfc  =  xj-  +  Kfc(yfc  -  C*x; )  (4.4) 

Pt  =  (I-KfcCfc)P;  (4.5) 

Prediction: 

=  AkxJ  (4.6) 

=  AiP:aJ'  +  Q4  (4.7) 


Figure  4.2:  A  block  diagram  of  the  linear  Kalman  filter. 

Figure  4.2  depicts  the  operation  of  the  Kalman  filter  in  a  block  diagram.  Con¬ 
ceptually,  the  update  stage  incorporates  a  new  measurement  y^  into  the  current 
estimate  of  the  state  x*  by  correcting  for  the  difference  between  actual  measurement 
y^  and  expected  measurement  C^x*  .  The  gain  K*  is  chosen  so  that  the  variance 
of  the  new  estimate  (trace  of  the  covariance  matrix  PjJ’)  is  minimal.  The  prediction 
stage  transforms  state  and  covariance  estimate  using  the  known  system  dynamics. 
The  filter  is  an  optimal  estimator  in  the  sense  that  it  minimizes  the  length  of  the 
error  vector  x*  -  x^  which  is  guaranteed  to  always  decrease. 


4.S:  Nonlinear  measurement  filter 
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4.2  Nonlinear  measurement  filter 

Many  real-world  systems  cannot  be  modeled  by  a  linear  measurement  equation  (4.2). 
A  nonlinear  me2isurement  model  is 


y*  =  f(3C*)  +  Vfc.  (4.8) 

In  this  case  the  update  equations  can  be  reformulated: 

X*  =  Pj-cnc»p;cr + R*]-'  (4.9) 

(4.10) 

Pt  =  (i-KiCOP;  (4.11) 


where  Ci,  =  dijdxh.  This  result  is  obtained  (Gelb  [28])  after  Taylor  series  expansion 
of  f  so  that  Kjb  is  only  an  approximation  to  the  truly  optimal  gain.  It  is  often  referred 
to  as  the  extended  Kalman  filter. 


4.3  Implicit  measurement  filter 

As  we  will  see,  some  estimation  problems  in  vision  cannot  even  be  formulated  in  the 
nonlinear  form  (4.8)  since  it  requires  the  measurement  to  be  an  explicit  function  of 
the  state  vector.  In  these  cases  the  relationship  between  state  and  measurement  is 
implicit 

g(yfc  -  Vfc,Xk)  =  0.  (4.12) 

This  case  is  not  considered  in  the  Kalman  filter  literature  but  in  appendix  A  we  show 
that  using  a  Taylor  series  expansion  as  for  the  nonlinear  filter  above  similar  update 
equations  can  be  derived; 

Kfc  =  P;CnC*PfcC[  +  DfcRfcD[]-‘  (4.13) 

xt  =  Xfc  -f-Kfcg(Xfc,yJ  (4.14) 

Ft  =  (I-KfcCOP^  (4.15) 

where  =  dgfdxk  and  D*  =  dgfdy,,.  This  result  also  follows  from  the  implicit 
function  theorem.  Just  as  in  the  previous  case  the  gain  is  not  truly  optimal  due  to 
the  linearized  approximation. 


4.4  Alternative  formulation  of  the  filter 

The  form  of  the  filter  equations  (4.3)  -  (4.7)  is  the  most  common  one  as  it  im¬ 
mediately  reflects  the  operation  of  the  estimation  procedure.  From  a  computational 


48 


Chapter  4:  Recursive  Estimation  Theory 


Update  Predict 


Figure  4.3:  A  block  diagram  of  the  simplified  linear  Kalman  filter. 

point  of  view,  however,  the  number  of  matrix  multiplications  and  inversions  is  very 
high  and  can  be  reduced  considerably  (see  [15]).  Instead  of  using  the  covariamce 
matrix  Pfc  we  introduce  its  inverse  S*  =  which  could  be  termed  the  ’’certainty 
matrix”.  The  filter  equations  now  become 
Update: 

St  =  Sfc+c[R;-*Cfc 

Kfc  =  (Str'ClR^^ 

Xfc  =  x;  +  KkiYk  -  CfcXfc  ) 

Prediction: 

=  Ajxf  (4.19) 

S;+,  =  a;''s:a;‘  (4.20) 

The  operation  of  the  simplified  linear  filter  is  depicted  in  the  block  diagram  of 
figure  4.3.  The  same  simplifications  apply  to  the  nonlinear  filter  (4.9)  -  (4.11)  vdthout 
modification  and  for  the  implicit  filter  (4.13)  •  (4.15)  by  replacing  Rjk  with  D^Rj^D^. 
In  each  case  the  number  of  matrix  operations  is  reduced  considerably. 

4.5  Filter  update  and  energy  functionals 

I  will  show  that  the  update  stage  of  the  filter  minimizes  a  simple  energy  function  on 
the  state  vector.  The  energy  function 

f;(x+)  =  i(x+  -  x-)^S-(x+  -  X-)  +  ^(y  -  Cx+)^R-‘(y  -  Cx+)  (4.21) 


(4.16) 

(4.17) 

(4.18) 


4.6:  Properties  of  the  Kalman  filter 
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has  a  mimmum  state  x***  that  is  ’’closest”  to  both  the  current  state  x~  and  the 
current  measurement  y  each  weighted  by  its  covariance  matrix  (It  can  be  derived  by 
marginalizing  the  posterior  Gaussian  distribution  of  x'*'  given  the  measurement  y). 
In  addition,  if  S~  is  non-diagonal  the  energy  function  will  enforce  a  correlation  of 
the  elements  of  the  vector  x***  among  each  other  (such  as  a  smoothness  constraint). 
Differentiation  with  respect  to  x'*'  yields 

=  {S-  +  C^R-"C)x+  -  S-x-  -  C^R-V  =  0  (4.22) 

ax"*" 

which  simplifies  to 

x+  =  X-  -f-  (S+)-*C^R-^(y  -  Cx")  (4.23) 

if  we  introduce  =  S“  -I-  C^R"^C.  This  is  just  the  modified  update  equation  for 
the  Kalman  filter  (4.17)  and  S'**  is  the  updated  covariance  matrix  (4.16). 

As  we  have  seen  in  chapter  2,  energy  functionals  are  used  in  instantaneous  surface 
reconstruction  to  formulate  constraints  on  the  desired  surface.  The  fact  that  the 
update  stage  of  the  Kalman  filter  minimizes  such  an  energy  functional,  will  help  us  to 
establish  the  essential  connection  between  the  Kalman  filter  update  and  conventional 
surface  reconstruction  techniques. 


4.6  Properties  of  the  Kalman  filter 

A  number  of  properties  of  the  Kalman  filter  are  noteworthy,  beyond  the  fact  that  it 
minimizes  an  energy  function. 

•  The  estimates  of  the  linear  Kalman  filter  can  be  shown  to  be  optimal  in  the 
sense  that  the  expected  deviation  between  the  estimate  and  the  true  value  is 
minimal  among  all  possible  estimators  (linear  or  nonlinear).  For  this  reason, 
the  linear  Kalman  filter  is  frequently  referred  to  as  the  minimum  variance 
estimator. 

•  By  considering  the  probability  density  functions  of  the  estimation  and  mea¬ 
surement  processes,  we  can  show  that  the  estimate  of  the  Kalman  filter  is  the 
one  of  maximum  likelihood  in  the  sense  that  the  conditionail  probability  density 
of  the  current  estimate  given  all  previous  measurement  values  has  a  maximum 
at  the  value  that  the  Kalman  filter  computes. 

•  It  can  be  shown  that  the  linear  Kalman  filter  estimate  converges  to  the  true 
value. 

In  plain  terms:  the  linear  Kalman  filter  will  determine  the  correct  value  and  do  so 
faster  than  any  other  estimator. 
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For  our  purposes,  it  is  useful  to  see  if  these  properties  of  the  Kalmsui  filter  persist 
if  some  of  the  rather  stringent  preconditions  are  dropped.  We  will  consider  three 
cases  that  are  relevant  to  the  work  in  this  thesis: 

•  If  the  noise  processes  on  state  or  measurement  are  not  Gaussian,  the  Kalman 
filter  is  no  longer  an  optimal  estimator,  however,  it  is  still  the  optimal  linear 
estimator.  In  other  words,  there  may  be  nonlinear  estimators  that  converge 
faster  than  the  Kalman  filter,  but  among  all  linear  estimators,  it  is  the  best. 

•  If  the  noise  processes  of  state  and  measurement  vector  are  not  uncorrelated, 
an  alternative  formulation  of  the  Kalman  filter  (see  [28])  exists  which  has  all 
the  properties  of  the  original  version.  Using  the  original  version  of  the  filter  on 
a  problem  with  correlated  noise  processes  (effectively  ignoring  the  correlation) 
results  in  a  loss  of  the  optimality  property  of  the  filter. 

•  The  nonlinear  and  implicit  versions  of  the  Kalman  filter  use  Taylor  series  ap¬ 
proximations  to  the  nonlinear  functions.  As  a  consequence,  they  need  not  be 
optimad  or  even  converge.  Gelb  [28]  writes  about  the  extended  Kalman  filter 
on  page  189:  “There  is  no  guarantee  that  the  actual  estimate  obtained  will  be 
close  to  the  truly  optimal  estimate.  Fortunately,  the  extended  Kalman  filter 
has  been  found  to  yield  accurate  estimates  in  a  number  of  important  practical 
applications.” 


Recursive  Estimation  and 
Temporal  Surface  Reconstruction 

Chapter  5 


In  this  chapter  we  will  investigate  how  temporal  surface  reconstruction  can  be 
formulated  as  a  recursive  estimation  problem.  At  the  heart  of  this  formulation  is 
the  task  of  determining  the  quantities  that  constitute  the  state  and  measurement 
variables  of  the  dynamical  system  as  well  as  its  state  and  measurement  equations. 
Other  sections  in  this  chapter  are  devoted  to  discussions  of  the  update  and  prediction 
stages  of  the  Kalman  filter  for  temporal  surface  reconstruction  and  the  initialization 
of  the  filter. 

5.1  Intuitive  concepts 

Intuitively,  the  temporal  surface  reconstruction  problem  maps  nicely  into  a  dynam¬ 
ical  system  so  that  the  desired  surface  structure  can  be  estimated  with  a  recursive 
estimation  procedure.  In  our  case  the  surface  or  structure  is  unknown  and  we  would 
like  to  estimate  it.  In  addition,  the  surface  may  change  over  time  due  to  the  possible 
relative  motion  between  observer  and  scene.  Likewise,  a  dynamical  system  has  an 
internal  state  that  changes  dynamically  over  time  and  is  the  subject  of  the  recursive 
estimation.  Hence,  we  can  think  of  the  three-dimensional  surface  structure  as  the 
state  of  a  dynamical  system. 

For  the  purpose  of  estimation,  we  have  available  to  us  the  sequences  of  images  or 
derived  quantities  such  as  the  optical  flow.  They  are  related  to  the  unknown  surface 
structure  via  the  “visual  mechanism”  such  as  stereo,  motion  or  shading.  Similarly, 
the  measurement  part  of  a  dynamical  system  relates  the  externally  available  mea¬ 
surement  vector  to  the  unknown  internal  state  vector.  This  suggests  that  we  could 
interpret  the  images /optical  flow  as  measurement  values  and  the  visual  mechanism 
as  the  measurement  equations  of  a  dynamical  system. 

These  choices  for  the  state  and  measurement  vectors  determine  a  dynamical  sys¬ 
tem  that  models  the  imaging  process  for  a  temporally  dynamic  surface.  Figure  5.1 
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Figure  5.1:  The  qualitative  dynamical  system  that  describes  the  imaging  process  for 
temporally  dynamic  surfaces. 

qualitatively  depicts  this  dynamical  system. 

The  system  model  (the  left  block)  describes  the  dynamic  change  in  the  state 
vector  which  is  the  depth  map  in  our  case.  Therefore  the  system  dynamics  corre¬ 
sponding  to  (4.1)  must  describe  the  change  in  depth  values  from  one  time  instant  k  to 
the  next.  As  we  are  assuming  rigidity  of  scene  objects,  a  temporal  change  in  depth 
values  can  only  be  due  to  a  relative  motion  between  the  camera  and  the  surface. 
Hence,  the  system  dynamics  are  simply  the  kinematic  transformation  (translation 
and  rotation)  of  points  in  space. 

The  measurement  model  (the  right  block)  describes  the  relationship  between  the 
state  (the  depth  in  our  case)  and  the  measurable  values  (the  image  values).  In 
accordance  with  (4.2)  the  measurement  model  will  therefore  encapsulate  the  visual 
mechanism  which  does  precisely  this:  it  relates  structure  and  image  values.  As  a 
consequence,  the  measurement  model  depends  on  the  specific  visual  mechanism  that 
is  being  used  for  the  surface  reconstruction.  In  the  case  of  stereo  for  example,  the 
measurement  equation  describes  the  relationship  between  stereo  matches  and  the 
distance  to  scene  points.  For  shape  from  shading,  the  measurement  encapsulates  the 
reflectance  function  that  relates  brightness  and  depth  gradients.  The  construction 
of  a  recursive  estimator  will  have  to  be  different  for  each  case. 


5.2  A  Recursive  Estimator  for  the  Temporal  Sur¬ 
face  Model 

Now  that  we  have  gained  some  insight  on  how  a  temporal  surface  can  be  modeled 
as  a  dynamical  system,  we  will  construct  a  recursive  estimator,  that  will  recover 
the  surface  when  given  the  measured  image  values.  This  section  gives  a  general 
description  of  the  filter  construction  and  operation.  A  detailed  discussion  of  each 
aspect  of  the  recursive  estimator  follows  in  the  subsequent  chapters. 


5.2:  A  Recursive  Estimator  for  the  Temporal  Surface  Model 
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As  we  recall  from  chapter  4  and  figure  4.2,  the  Kalman  filter  maintains  an  estimate 
5tk  of  the  state  of  a  dynamical  system  along  with  its  covariance  by  processing  the 
sequence  of  measurements  and  corresponding  covariances  Rj^.  The  processing 
proceeds  in  two  stages  for  each  iteration  k:  the  update  and  the  prediction  stage. 

5.2.1  The  Filter  State  and  Measurement 

Our  first  task  is  to  determine  the  state  and  measurement  vectors  2Llong  with  their 
associated  covariance  matrices  such  that  the  resulting  recursive  estimator  will  provide 
a  solution  to  the  temporal  surface  reconstruction  problem.  The  intuitive  model  of 
the  dynamical  system  outlined  above  provides  the  basis  for  our  choices. 

Following  the  intuitive  concepts  introduced  in  section  5.1,  the  state  vector  repre¬ 
sents  the  surface  structure  that  we  intend  to  recover.  A  structural  description  of  a 
visual  scene  is  given  by  a  depth  map  (Zij)  (see  section  2.1)  in  which  the  correspond¬ 
ing  depth  value  is  stored  for  each  pixel  (i,j)  in  the  image  plane.  To  construct  the 
state  vector  x*  of  the  dynamical  system  we  collect  all  depth  values  of  frame  k  into  a 
column  vector  which  is  accomplished  by  adjoining  the  rows  of  the  depth  map 


The  construction  of  the  state  vector  from  the  depth  map  can  be  formalized  as 

®im+j  —  Zij  t  =  0,  ...,Ti  1,_7  =0,...,  m  1.  (5.2) 

Consequently,  the  covariance  matrix  P  of  the  state  vector  is  an  x  AT  matrix 
that  contains  the  variances  of  the  depth  estimates  in  the  diagonal  and  the  covariances 
between  depth  values  in  the  remaining  entries.  Note  the  extraordinary  size  of  this 
matrix,  as  iV  is  the  number  of  pixels  in  the  image  value  array!  A  matrix  of  this  size 
is  beyond  computational  manageability.  However,  by  using  the  inverse  S  =  P~^  in 
the  simplified  version  of  the  Kalman  filter  (see  section  4.4)  and  by  taking  advantage 
of  the  local  nature  of  surface  correlation,  the  filter  operates  with  a  sparse,  banded 
representation  of  the  inverse  covariance  matrix  that  requires  significantly  less  storage. 
Details  are  described  in  section  6.3. 

The  measurement  vector  y  represents  the  image  values  that  are  available  to  us 
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through  the  visual  mechanism.^  The  image  brightness  values  are  given  in  an  array 
(Eij)  and  can  therefore  be  mapped  into  the  measurement  vector  y  in  exactly  the 
same  way  as  the  depth  map  was  mapped  into  the  state  vector  above.  We  collect  all 
the  elements  of  the  image  frame  (Eij)  into  a  column  vector  yi,  so  that 


(Eij)  = 

Eo,o 

•  •  •  Eo,m-l 

yo 

yi 

^n~l,0 

'  •  •  -E'n-l.m-1 

— ►y  = 

.  yN-i . 

■^'0,0 
^0,1 

Elfi 

L  J 


(5.3) 


which  can  be  written  explicitly  as 

yim+j  =  Eij  i  =  0,...,n  -  l;j  =  0,...,m- 1.  (5.4) 

The  covariance  matrix  R  of  the  measurement  vector  represents  the  uncertainty 
in  the  image  values.  In  the  case  of  brightness  measurements  Eij  for  example,  this 
uncertainty  is  due  to  measurement  noise  the  distribution  of  which  can  be  assumed  to 
be  Gaussian  iV(0,cr)  as  required  by  the  Kalman  filter.  If  pixel  noise  is  uncorrelated 
and  identically  distributed  the  measurement  covariance  is  R  =  <r*I  where  I  is  the 
identity  matrix.  Note  that  R  is  once  again  a  matrix  of  very  large  dimensions  (here 
N  X  N)  but  remains  computationally  manageable  due  to  its  special  structure. 


5.2.2  The  Filter  Update  Stage 

The  next  task  is  to  establish  the  operation  of  the  update  stage  of  the  recursive 
estimator.  As  shown  in  figure  5.2,  this  part  of  the  filter  combines  the  latest  mea¬ 
surement  Yfc  with  the  current  estimate  x^  to  compute  the  updated  estimate  x^.  In 
terms  of  the  surface  reconstruction  problem:  the  update  stage  will  combine  the  newly 
obtained  image  values  with  the  current  estimate  of  surface  structure  to  produce  an 
improved  structure  estimate. 

As  a  consequence,  this  update  process  depends  on  the  specific  image  values  that 
are  used  as  measurements  and  the  specific  visual  mechanism  that  is  used  to  link  them 
to  the  surface  structure  values.  For  each  visual  mechanism  such  as  stereo,  shading 
or  motion,  the  measurement  equation 

y  =  Cx  or  y  =  f(x)  or  g(x,y)  (5.5) 

‘  As  we  have  seen  in  chapter  2  these  values  need  not  be  exclusively  image  brightness  but  could 
also  be  optical  flow  or  stereo  matches.  For  the  purpose  of  introducing  the  concepts,  we  will  work 
with  image  brightness  in  this  section  and  describe  alternatives  in  the  following  chapters. 
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Figure  5.2;  The  update  stage  of  the  Kalman  filter  for  temporal  surface  reconstruction. 

will  be  different  and  so  will  the  corresponding  update  stage.  Chapter  8  describes  the 
update  stage  for  temporal  surface  reconstruction  &om  optical  flow  measurements  in 
detail.  Chapter  9  does  the  same  for  the  shading  mechanism  and  chapter  10  focuses 
on  the  direct  motion  mechanism. 

There  are,  however,  a  number  of  general  statements  that  can  be  made  about  the 
update  stage  without  restriction  to  a  particular  visual  mechanism  that  highlight  the 
similarity  between  the  update  procedure  and  the  instantaneous  surface  reconstruc¬ 
tion  techniques  from  chapter  2. 

The  update  stage  of  the  simplified  recursive  estimator  (4.16),  (4.17),  (4.18)  in¬ 
volves  the  matrix  inversion  of  the  certainty  matrix  : 

it  =  K  +  (s:  )'‘cJ'R,-‘(y*  -  c„xi-)  (5.6) 

We  abbreviate  the  residual  p  =  ic*  —  and  q  =  C2'Rj"*(yfc  —  Cs£a)  and  the 
problem  simplifies  to 

P  =  (S*)"^q  (5-7) 

where  SjjJ'  is  a  sparse,  banded  matrix.  ^  Due  to  this  sparse  nature  of  the  matrix 
we  can  solve  (5.7)  by  an  iterative  relaxation  method  such  as  Jacobi  or  Gauss- 
Seidel  (see  Golub  and  Van  Loan  [30]).  Gauss-Seidel  is  the  preferred  method  for  serial 
implementations  and  is  described  by  the  following  iteration  equation 

-  E  (5-8) 

where  L{i)  is  the  set  of  indices  j  with  non-zero  entries  in  the  ith  row  of  S.  The 
iterative  process  is  initialized  with  p  =  0  (this  requires  only  a  small  number  of 
iterations  in  the  steady  state  of  the  filter  estimation)  and  the  state  vector  can  be 

is  only  sparse  and  banded  in  general  if  we  make  the  assumption  of  a  viewpoint  independent 
surface  model  as  explained  in  section  6.3. 


56 


Chapter  5:  Recursive  Estimation  and  Temporal  Surface  Reconstruction 


obtained  easily  from  +  p.  Algorithmically,  the  update  stage  (5.8)  of  the 

Kalman  filter  is  therefore  exactly  the  same  iterative  Gauss-Seidel  procedure  used  for 
instantaneous  surface  reconstruction  such  2ls  (2.12),  (2.18)  and  (2.21). 

From  section  4.5  we  recall  that  the  update  stage  of  a  Kalman  filter  determines 
the  new  estimate  such  that  it  minimizes  an  energy  function 

F(x+)  =  i(x+  -  x-)^S-(x+  -  X-)  +  \{y  -  Cx+)^R-^(y  -  Cx+)  (5.9) 
which  enforces 

•  Compatibility  of  the  new  estimate  xj^  with  the  current  measurement 

•  Closeness  of  the  new  estimate  xjjl’  to  the  previous  estimate  xj^ . 

•  ’’Smoothness”  or  similar  correlation  of  the  elements  of  x^  for  appropriate  values 
of  S'. 

If  we  now  remember  that  each  one  of  the  instantaneous  surface  reconstruction  pro¬ 
cedures  in  chapter  2  employed  an  energy  function  that  enforced  compatibility  of  the 
reconstructed  surface  with  the  measured  image  values  as  well  as  surface  smoothness 
we  come  to  the  following  realization:  The  update  stage  of  the  recursive  estimator 
for  a  given  visual  mechanism  is  identical  to  the  corresponding  instantaneous  surface 
reconstruction  procedure  with  two  important  additional  features: 

•  The  recursive  estimator  additionally  enforces  ’’closeness”  to  the  estimate  from 
the  last  iteration  and  thereby  carries  over  the  information  from  previous  esti¬ 
mates. 

•  The  energy  terms  are  weighted  with  the  covariance  matrices  so  as  to  explicitly 
take  the  uncertainty  into  account. 

In  other  words:  the  Kalman  filter  performs  a  (stochastic  weighted)  surface  recon¬ 
struction  at  each  time  k  and  thereby  accomplishes  the  primary  goal  of  extending 
surface  reconstruction  into  the  temporal  domain. 

5.2.3  The  Filter  Prediction  Stage 

Here  we  discuss  the  basic  operation  of  the  prediction  stage  of  the  recursive  es¬ 
timator.  As  shown  in  figure  5.3,  this  part  of  the  filter  takes  the  updated  estimate 
Xfc  and  its  covariance/certmnty  SjJ’  at  time  k  and  predicts  what  these  values  will  be 
at  time  ^  -I-  1.  For  the  purpose  of  temporal  surface  reconstruction,  this  means  that 
the  prediction  stage  must  transform  the  estimate  of  surface  structure  (depth  map) 
corresponding  to  the  image  values  in  frame  k  to  the  depth  map  corresponding  to 
frame  k  +  1. 
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Figure  5.3:  The  prediction  stage  of  the  Kalman  filter  for  temporal  surface  recon¬ 
struction. 

Under  the  rigid  body  assumption,  the  only  way  in  which  a  change  in  surface 
structure  can  occur  between  frames  is  by  a  relative  motion  of  the  surface  with  respect 
to  the  camera.  At  first  glance,  this  appears  to  be  a  very  simple  task.  Let  the 
motion  between  time  k  and  A;  -I-  1  be  represented  by  a  translation  t  and  a  rotation 
n.  The  depth  value  at  location  {xj,yi)  in  the  depth  map  corresponds  to  a  point 
P  =  [xZ//,t/Z//,  Z]  on  the  surface.  This  point  will  move  according  to  the  kinematic 
equation 

Pfc+i  =  -t  -  nPk.  (5.10) 

The  predicted  surface  can  therefore  be  obtained  by  applying  this  transformation  to 
all  the  points  P*  corresponding  to  entries  (i,j)  in  the  current  depth  map/state  vector 
and  filling  the  predicted  depth  map  with  the  Z-coordinates  of  the  transformed  points 
Pk+l- 

This  approach  is  missing  one  important  point:  The  depth  map  entry  Z.-j  must 
be  the  distance  along  the  optical  axis  at  which  a  ray  through  pixel  strikes 

the  surface.  When  a  depth  map  entry  Zij  is  converted  to  a  point  P  and  transformed 
according  to  (5.10)  not  only  does  the  distance  Z  change  but  also  the  X  and  Y  coordi¬ 
nates  of  P  may  change  and  hence  the  image  plane  location  (xj,yi).  The  transformed 
Z  may  have  to  be  assigned  to  a  new  depth  map  location  In  essence,  it  is 

necessary  to  resample  the  warped  surface  at  the  depth  map  grid  point  locations. 

In  addition  the  prediction  stage  must  accomplish  the  transformation  of  the  covari¬ 
ance/certainty  matrix  S  in  such  a  way  that  the  resampling  of  the  warped  surface  is 
taken  into  account.  Fortunately,  the  prediction  of  the  depth  map  and  its  covariance 
are  independent  of  the  particular  visual  mechanism  that  is  used  to  recover  structure 
from  image  values.  Chapter  7  describes  the  algorithm  for  prediction  of  the  depth 
map  state  vector  and  the  associated  covariance  in  detail. 

5.2.4  Filter  initialization 

Since  the  filter  process  is  recursive  in  nature  it  must  be  initialized  at  some  point. 
We  must  provide  initial  values  for  the  state  Xq  and  the  associated  covariance  matrix 
Po.  The  standard  procedure  in  recursive  estimation  is  to  initialize  the  entries  of  Xo 
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to  zero  and  the  entries  of  Po  to  oo.  This  reflects  the  fact  that  the  uncertainty  in  the 
initial  value  is  very  high. 

In  our  particular  case,  however,  some  additional  information  is  available  and  can 
be  used  to  initialize  the  state  estimate.  As  the  detailed  derivation  in  chapter  6 
will  show,  a  proper  initialization  of  the  inverse  covariance  matrix  S  can  be  used  to 
incorporate  prior  models  of  the  surface  structure  such  as  the  smoothness  constraints 
that  we  are  already  familiar  with. 


5.3  To  Do 

This  section  enumerates  the  components  that  constitute  a  temporal  surface  recon¬ 
struction  algorithm.  The  purpose  is  two-fold:  First,  it  will  give  the  reader  an  overview 
of  what  to  expect  in  the  following  detailed  description  of  the  parts  of  the  temporal 
method.  Second,  since  visual  mechanisms  other  than  the  ones  discussed  in  this  thesis 
can  be  embedded  into  the  temporal  reconstruction  framework,  this  section  provides 
a  recipe  for  the  embedding. 

1.  Pick  a  state  vector  x  and  an  meausurement  vector  y.  For  the  purpose  of  surface 
reconstruction  the  state  vector  will  be  the  concatenation  of  all  depth  values 

or  related  values  such  as  the  disparity  as  shown  in  (5.1).  The  measurement 
’  vector  must  contain  values  that  are  directly  obtainable  from  the  imaging  pro¬ 
cess. 

2.  Determine  the  relationship  between  the  state  x  and  measurement  y  from  the 
particular  visual  mechanism.  Find  the  matrix  C  from  either 

y  =  Cx  if  the  relationship  is  linear 

y  =  f(x)  C  =  ^  if  the  relationship  is  non-linear 

g(x,y)  =  0  C  =  5“  if  the  relationship  is  implicit  non-linear 

In  the  implicit  non-linear  case,  the  matrix  D  =  ^  must  also  be  determined. 
The  linear  case  is  the  preferred  one  and  can  often  be  attained  through  suitable 
choice  of  the  measurement  vector. 

3.  Determine  the  covariance  R  of  the  measurement  vector.  If  the  measurement 
vector  consists  of  the  brightness  values,  they  can  be  modeled  as  independent 
and  identically  distributed,  and  R  =  (t^I.  If  the  measurement  vector  is  de¬ 
rived  from  the  brightness  values,  the  variances  must  be  propagated  through 
the  relationship  used  in  the  derivation  to  obtain  R. 
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This  three-step  process  determines  all  of  the  quantities  necessary  for  the  update 
stage  of  the  filtering  algorithm  (4.16)  -  (4.18).  Examples  of  this  process  for  three 
different  visual  mechanisms  are  given  in  chapters  8,  9  and  10. 

Two  other  steps  are  necessary  before  a  temporal  surface  reconstruction  algorithm 
can  be  implemented:  the  prediction  and  initialization.  Both,  however,  are  indepen¬ 
dent  of  the  visual  mechanism.  They  are  described  in  chapters  7  and  6  and  can  be 
implemented  directly  regardless  of  the  particular  application. 


Filter  Initialization  and  Prior 
Surface  Models 


Chapter  6 


This  chapter  describes  how  initial  values  for  the  state  Xq  and  the  certainty  matrix 
So  can  be  chosen.  Moreover,  we  show  that  a  proper  choice  of  Sq  will  enforce  prior 
models  that  we  may  have  of  the  surface  structure  such  as  smoothness. 

6.1  Surface  Models 

Surface  models  represent  information  about  the  structure  of  a  surface  that  we  may 
have  even  before  we  begin  any  surface  reconstruction  from  images.  Smoothness  is 
the  most  common  example  of  a  prior  model  used  in  computational  vision.  Surface 
models  have  evolved  from  simple  deterministic  models  to  complex  stochastic  models. 
The  work  of  Szeliski  [93]  provides  an  exceUent  overview  and  serves  as  the  basis  for 
the  presentation  in  this  chapter. 

A  common  representation  of  prior  models  is  in  terms  of  energy  functions  on 
the  depth  map  entries.  They  impose  a  certain  constraint  on  the  surface  when  it  is 
computed  as  the  minimum  of  the  energy  function.  An  example  that  we  have  already 
encountered  in  chapter  2  on  instantaneous  surface  reconstruction  is  the  membrane 
model 

£(Z)  =  5  +  (Zm..  -  (6-1) 

Introduced  by  Terzopoulos  [96]  and  Crimson  [31],  this  model  is  used  as  a  “stabilizer” 
in  the  regularized  solution  of  ill-posed  early  vision  problems  (Poggio  et  al.  [80]).  The 
effect  of  the  energy  function  is  easily  identified:  it  penalizes  for  differences  between 
horizontally  and  vertically  neighboring  depth  map  entries  and  thereby  forces  the 
surface  to  be  smooth.  If  used  alone,  any  constant  surface  (Zij)  will  minimize  the 
energy  function  (6.1).  Alternative  surface  models  such  as  the  thin  plate  model  have 
also  been  studied  extensively  in  the  work  cited  above  and  can  be  used  in  the  same 
way  as  the  membrane  model  in  the  following  derivation. 
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Figure  6.1:  The  interaction  between  depth  map  entries  in  the  membrane  surface 
model. 

If  we  collect  all  the  depth  map  entries  into  a  one-dimensional  vector 

i  =  0,...,n  -  l;j  =  0,...,m  -  1  (6.2) 

the  membrane  energy  model  can  be  written  as  a  quadratic  form 

£(x)  =  ^x^Sx  (6.3) 

where  S  is  a  sparse  banded  matrix  with  5  entries  per  row: 

{—1  I  =  k  —  l,k -i- l,k  —  Tn,k  +  m 

4  l  =  k  (6.4) 

0  otherwise 

The  matrix  S  establishes  an  interaction  between  a  depth  map  entry  Zij  and  its  four- 
connected  neighbors  as  shown  in  figure  6.1.  Note  that  the  construction  of  the  vector 
X  is  identical  to  the  one  used  for  the  state  vector  of  the  recursive  estimator  (5.1). 


6.2  Probabilistic  Surface  Models 

Surface  models  represented  by  energy  functions  can  also  be  modeled  stochastically. 
This  is  a  prerequisite  for  the  application  of  probabilistic  estimation  techniques  such 
as  Bayesian  estimators  that  represent  uncertainty.  The  idea  is  to  model  the  surface 
by  giving  a  probability  distribution  for  the  depth  field.  Geman  and  Geman  [29] 
showed  how  to  arrive  at  a  probability  distribution  for  the  field  values  given  the 
energy  function  constraining  its  shape.  It  is  the  Gibb^s  distribution 

exp(-£(x)/r) 

Exexp(-£(x)/r) 


p(x)  = 


(6.5) 
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where  T  is  the  ’’temperature”  of  the  field.  If  E{x)  involves  only  interactions  among 
neighboring  field  values  as  for  the  membrane  energy  above  the  probability  distribu¬ 
tion  describes  a  Markov  Random  Field,  a  widely  studied  stochastic  model  for  surface 
structure  (Marroquin  [63],  Poggio  et  al.  [79]). 

We  observe  two  key  features  of  the  stochastic  model  for  membrane- type  surfaces: 
The  distribution  (6.5)  is  a  multivariate  Gaussian  distribution  and  the  covariance  of 
the  distribution  is  P  =  (for  suitable  choice  of  T).  The  certainty  matrix  S  plays 
the  central  role  in  this  formulation  as  it  encapsulates  the  particular  prior  model  that 
is  being  imposed  on  the  surface. 


6.3  Filter  Initialization 

The  Kaiman  filter  estimation  process  (4.16)  -  (4.20)  must  be  initialized  by  choosing 
initial  values  Xq,  Sq  for  the  state  vector  and  the  associated  certainty  matrix.  These 
choices  must  be  made  such  that 

x~iV(xo,So^)  (6.6) 

at  the  outset,  i.e.  such  that  they  determine  the  prior  stochastic  model  of  surface 
structure.  As  we  have  seen  above,  the  choice  of  the  certainty  (inverse  covariance)  S 
determines  the  surface  model.  If,  for  example,  we  set  So  as  in  (6.4),  we  will  impose 
membrane  smoothness  as  the  prior  model  on  the  surface.  This  is  done  throughout 
the  experiments  described  in  this  thesis.  Other  prior  models  such  as  the  thin  plate 
can  be  imposed  in  the  same  way  described  above. 

As  mentioned  before,  any  surface  of  constant  depth 

^im+j  —  ^  —  0,*..,n  J  ”  0,  .  .  .  ,  771  1  (6.7) 

minimizes  the  constraint  energy  for  the  membrane  model  (6.1).  In  the  experiments 
in  this  thesis,  we  chose  Zo  to  be  an  average  value  of  depth  in  the  scene  as  it  may  be 
obtained  from  a  crude  depth  sensor  such  as  an  ultrasonic  device. 

The  sparse  and  banded  structure  of  the  matrix  S  is  of  particular  importance 
for  the  calculation  of  the  Kalman  filter  update  (4.16)  -  (4.18)  which  requires  the 
inversion  of  S.  In  general,  the  inversion  is  a  formidable  task  considering  that  S  is  of 
size  71771  X  71771.  If  S  is  sparse  and  banded,  the  inversion  can  be  accomplished  through 
an  iterative  process  such  as  (5.8)  that  costs  only  0{nm). 

Although  the  sparse  banded  nature  of  Sfc  is  not  preserved  by  the  filter  predic¬ 
tion  equations,  we  can  make  the  simplifying  assumption  that  prior  models  such  as 
surface  smoothness  are  independent  of  the  camera  viewpoint  and  should  therefore 
be  unaffected  by  the  Kalman  filter  equations.  This  allows  us  to  separate  the  inverse 
covariance  matrix  S*  at  time  k  into  a  diagonal  matrix  S*  and  the  prior  matrix  S: 

Sfc  =  Sfc  S.  (6.8) 
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Note  that  we  can  safely  replace  with  S*  in  the  filter  update  equation  (4.16) 
by  subtracting  S  on  both  sides  while  we  ignore  the  transformation  of  off-diagonal 
elements  in  the  (4.20).  This  is  an  important  contribution  to  the  computational 
feasibility  of  the  temporal  surface  reconstruction  scheme. 


Filter  Prediction 

Chapter  7 


This  chapter  describes  the  prediction  stage  of  the  recursive  estimator  for  temporal 
surface  reconstruction.  The  algorithm  accomplishes  the  warping  of  a  depth  map 
corresponding  to  a  rigid  body  motion.  In  addition,  the  prediction  of  the  covariance 
matrix  corresponding  to  the  depth  map  state  vector  is  described.  As  an  alternative, 
a  computationally  less  expensive  approximative  prediction  algorithm  concludes  the 
chapter. 

7.1  Prediction  of  the  Depth  Map 

The  prediction  stage  (4.6),  (4.7)  of  the  filter  process  must  account  for  changes  in  the 
state  vector  that  occur  between  sequential  measurements.  For  the  depth  estimation 
process  this  is  equivalent  to  a  change  in  depth  from  one  frame  to  the  next  which  can 
only  be  due  to  a  relative  motion  between  scene  and  imaging  system.  The  prediction 
process  must  therefore  transform  all  depth  map  entries  according  to  this  relative 
motion.  This  is  a  purely  geometric  transformation. 

We  are  given  a  depth  map  (Zij)  at  time  k,  the  relative  motion  between  camera 
and  scene  from  />;  to  /;  +  1  as  a  translation  vector  t  and  a  rotation  matrix  /7  and  the 
projection  geometry  (focal  length  /,  pixel  spacing  Ax,  Ay  and  image  size  w  x  h). 
This  situation  is  shown  in  figure  7.1.  The  objective  is  to  determine  the  depth  map 
(Zlj)  at  time  A:  +  1. 

The  idea  of  the  algorithm  is  as  follows:  each  entry  in  the  depth  map  corre¬ 
sponds  to  a  point  Pij  in  space  via  perspective  projection.  The  transformation  of  Pij 
from  time  A;  to  A;  -I-  1  can  then  be  accomplished  by  applying  the  rotation  matrix  and 
the  translation  vector.  Since  the  transformed  points  P[j  will  not  necessarily  project 
back  onto  depth  map  grid  points  at  time  -f  1  it  becomes  necessary  to  interpolate 
and  resample  the  surface. 
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Figure  7.1:  A  surface  corresponding  to  a  depth  map  and  the  effect  of  a  motion 
transformation. 

I  will  now  describe  the  steps  of  the  algorithm  in  more  detail: 

1.  Inverse  Projection: 

Each  depth  map  entry  Z^j  corresponds  to  a  depth  value  at  physical  location 

Xj  =  (j  -  (id  -  l)/2)Ai  and  =  (i  -  (/i  -  l)/2)Ay  (7.1) 

in  the  image  plane.  This  represents  a  point  Pij  =  [X,Y,Z]'^  in  space,  the  co¬ 
ordinates  of  which  are  given  by  inverting  the  perspective  projection  equations: 

X.i  =  ^  and  Y„  =  ^  (7.2) 

2.  Warping: 

Now  we  can  account  for  the  relative  motion  between  frames  by  applying  the 
rotation  and  translation  transformations  to  each  point  P^r 

=  -t  -  np,  (7.3) 

3.  Resampling: 

The  straightforward  approach  would  be  to  project  the  points  Plj  back  into  the 
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image  plane  to  obtain  the  new  depth  map.  However,  this  will  not  maintain  the 
representation  of  depth  on  a  grid  corresponding  to  the  image  pixels  since  the 
warped  points  do  not  necessarily  project  onto  grid  point  locations  (shown  in 
figure  7.1  on  the  right).  It  is  necessary  to  resample  the  warped  depth  map  at 
grid  point  locations. 


Figure  7.2:  Triangular  facet  subdivision. 

To  accomplish  this  we  group  the  pixel  locations  (i,j)  into  triplets  by  dividing 
each  grid  square  into  two  triangles  as  shown  in  figure  7.2.  We  approximate  the 
surface  between  the  3D  points  Pij  corresponding  to  a  triplet  by  a  plane.  Al¬ 
though  other  surface  approximations  are  possible  (bilinear,  bicubic  etc.)  they 
make  the  resampling  more  difficult.  After  warping  (7.3)  each  triplet  still  de¬ 
fines  a  plane  so  that  the  warped  surface  can  be  approximated  by  its  planar 
triangulation. 

To  resample  the  triangulated  surface  at  grid  point  (xj,t/j)  we  must  determine 
the  intersection  of  the  ray  through  the  grid  point  with  the  closest  spatial  trian¬ 
gle  in  the  warped  surface  approximation.  This  is  accomplished  by  initializing 
the  new  depth  map  with  oo  everywhere  and  repeating  the  following  steps  for 
each  warped  triplet  of  points  PI-: 

9  Project  all  three  corner  points  P'  —  [A'',  K*,  back  into  the  image  plane 
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using 

x'  =  and  y'  =  (7.4) 

•  Determine  all  the  grid  locations  (t)j)  that  lie  within  the  backprojected 
triplet  of  points  (x',y'). 

•  For  each  grid  location  (i,j)  identified  in  the  previous  step  find  the  inter¬ 
section  of  the  ray  through  that  grid  point  with  the  spatial  triangle  under 
consideration.  If  the  Z  value  of  the  intersection  point  is  less  than  the  one 
stored  in  the  depth  map  at  (i,j)  then  replace  it  by  Z. 

We  notice  that  the  ray  through  a  given  grid  point  can  have  multiple  intersec¬ 
tions  with  the  warped  surface  which  means  that  part  of  the  surface  has  been 
occluded  by  another  region.  In  this  case  the  algorithm  chooses  the  physically 
correct  value:  the  one  closest  to  the  camera. 

4.  Unassigned  grid  points: 

The  resampling  procedure  will  not  necessarily  assign  a  depth  value  to  each 
grid  point  that  physically  corresponds  to  new  areas  in  the  scene  that  became 
visible  due  to  the  camera  motion.  In  general,  we  can  say  nothing  about  the 
correct  depth  value  at  these  locations.  If,  however,  we  assume  that  the  surface 
is  somewhat  smooth  then  we  can  extrapolate  the  depth  values  at  these  points 
from  known  values  at  neighboring  locations. 

So  far  I  have  described  how  the  prediction  of  depth  corresponding  to  (4.6)  can  be 
accomplished.  Due  to  the  discrete  grid  representation  of  the  depth  map  the  piecewise 
linear  surface  approximation  was  introduced  and  hence  the  prediction  algorithm  only 
approximates  the  Kalman  filter  state  prediction  (4.6). 


7.2  Prediction  of  the  Depth  Covariance 

The  simplified  Kalman  filter  represents  uncerteiinty  in  the  depth  map  state  vector 
by  the  inverse  S  =  of  the  covariance  matrix  (certainty  matrix).  The  prediction 
stage  of  the  recursive  estimator  must  therefore  determine  the  certainty  after  a 
possible  motion  transformation  when  given  the  certainty  S^. 

As  we  have  seen  in  chapter  6,  the  matrix  can  be  decomposed 

Sfc  =  Sfc  +  S.  (7.5) 

where  is  a  diagonal  matrix  containing  the  inverse  variances  of  the  depth  values 
in  the  state  vector  and  S  is  the  sparse  banded  matrix  that  represents  the  prior 
model  that  we  may  have  imposed  on  the  surface.  Our  prediction  algorithm  will  only 
transform  the  inverse  depth  variances  while  leaving  the  prior  model  unaffected. 
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It  is  possible  to  propagate  the  variance  values  through  the  prediction  equations 
used  for  the  depth  values  above.  This  requires  a  Taylor  series  approximation  of 
the  nonlinear  projection  equations  and  the  assumption  of  stochastically  independent 
depth  map  entries  (which  is  not  the  case  if  a  non-trivial  prior  model  is  imposed). 
The  description  of  the  algorithm  is  rather  lengthy  and  has  therefore  been  included 
as  appendix  B. 

A  more  practical  approximation  was  proposed  by  Szeliski  [93]  and  seems  to  work 
just  as  well:  Due  to  the  piecevdse  planar  surface  approximation  in  the  prediction  of 
the  depth  values,  a  small  error  is  introduced.  The  variance  <r|  of  the  depth  value 
Z  reflects  the  uncertainty  in  the  prediction  and  should  therefore  be  inflated  by  a 
small  multiplicative  factor  c  to  reflect  the  higher  uncertainty.  For  the  update  of  the 
diagonad  inverse  covariance  matrix  this  means 

.(k+i)  _ 

The  depth  map  prediction  algorithm  described  previously  may  leave  some  loca¬ 
tions  on  the  image  grid  “unassigned”  corresponding  to  regions  in  the  image  that 
appear  in  frame  A:  -f  1  and  were  not  visible  in  frame  k.  Although  the  algorithm 
eventually  fills  in  these  locations  with  neighboring  values,  the  fiUed-in  depth  infor¬ 
mation  has  very  high  uncertainty  and  the  diagonal  values  in  the  predicted  certainty 
matrix  8^4.1  must  reflect  this.  To  accomplish  this,  the  diagonal  entries  in  that 
correspond  to  formerly  “unassigned”  grid  locations  in  the  depth  prediction  are  set 
to  the  same  value  that  S  was  initialized  with  (see  chapter  6). 


7.3  An  Efficient  Approximative  Prediction  Algo¬ 
rithm 

The  prediction  algorithm  for  depth  and  certainty  described  above  is  the  computa¬ 
tionally  most  expensive  part  of  the  temporal  surface  reconstruction  procedure  (see 
section  11.1).  This  is  mainly  due  to  the  resampling  step.  This  section  describes 
an  approximative  resampbng  algorithm  that  is  considerably  less  complex  and  yields 
good  results  in  practice. 

As  before,  we  are  given  a  depth  map  (Zij)  on  a  rectangular  grid.  Suppose  that 
the  depth  map  has  been  warped  according  to  the  interframe  motion  as  described  in 
steps  1  and  2  of  the  prediction  algorithm  in  section  7.1.  The  new  depth  value  Zij  as 
well  as  its  new  location  (ij,yi)  have  been  calculated  but  xj  and  may  not  coincide 
with  the  coordinates  of  any  grid  point. 

For  the  computation  of  the  new  depth  map  value  Z(x,y)  at  a  grid  location  (i,y) 
we  will  consider  all  of  the  warped  depth  values  Z  that  are  closer  to  (x,y)  than  to 


Figure  7.3:  Distance  weighted  resampling  of  depth  values. 

any  other  grid  location.  Suppose  that  there  are  n  such  depth  values  and  that  they 
are  subscripted  k  =  1,. . .  ,n.  This  is  shown  in  figure  7.3  for  n  =  4. 

The  new  depth  map  value  Z{x,y)  is  computed  as  the  weighted  sum  of  the 

Z{xk,yk) 

n 

^(3:,y)  =  J^v}kZ(xk,yk)-  (7.7) 

k=l 

A  good  weighting  function  should  fulfill  the  following  requirements: 

•  0  <  iwfc  <  1 

A=i 

•  The  tufc  should  decrease  as  the  distance  d*  =  (x^  —  x)*  +  (t/fc  —  y)*  decreases. 
We  therefore  choose  the  weighting  factors  to  be 


(7.8) 
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This  clearly  fulfills  all  of  the  above  requirements  but  some  special  cases  must  be 
considered.  Suppose  all  n  estimates  involved  in  the  interpolation  have  equal  distance 
dk  =  d  from  the  grid-point.  In  this  case 


Wk  = 


(7.9) 


which  means  that  all  estimates  are  weighted  equally  as  one  would  expect.  A  more 
tricky  case  occurs  when  some  number  I  of  the  estimates  are  actually  located  at  the 
grid-point,  i.e.  di,  =  0  for  k  =  1,...,/.  For  estimates  on  the  grid-point  we  can 
rearrange  the  expression  for  the  weights  (7.8)  to  obtain 


Wk 


(7.10) 


as  d),  — »  0  for  fc  =  Similarly  we  rearrange  the  expression  for  the  weights  in 

the  case  of  an  estimate  that  does  not  coincide  with  the  grid-point 


Wk  = 


i 


1  + 


4+  El 


(7.11) 


k=l+l 


as  d/,  — »  0  for  A:  =  1, . . . ,  In  other  words  we  ignore  all  estimates  that  are  not  on  the 
grid-point  and  obtain  the  interpolated  value  as  the  mean  of  the  estimates  located  at 
the  grid-point. 

The  idea  of  distance-weighted  resampling  can  be  implemented  in  an  algorithm 
that  runs  in  time  proportional  to  the  number  of  entries  in  the  depth  map  [Zij)  by 
maintaining  a  weight  array  (wtj)- 

1.  Initialize  the  predicted  depth  map  and  the  weight  array  (tu<j)  to  zero. 

2.  For  each  i  and  each  j  do  the  following: 

(a)  Calculate  the  point  P  =  \X,Y,Z]  corresponding  to  Apply  the  rigid 
body  motion  transform  to  P  and  project  it  back  into  the  image  plane  (see 
section  7.1).  The  result  is  Z  at  location  {x,y). 

(b)  Calculate  the  (possibly  non-integral)  grid  coordinates  (i,  j)  corresponding 
to  (x,  y)  using  equation  (2.4)  and  the  closest  integral  grid  point  coordinates 
(Tn,n)  =  {r{i),r{j))  where  r()  is  the  rounding  operator.  The  physical 
grid  point  coordinates  (p,  q)  closest  to  the  warped  depth  map  value  are 
obtained  from  (m,n)  via  equation  (2.3). 


72 


Chapter  7;  Filter  Prediction 


(c)  Calculate  the  distance  between  depth  map  value  and  closest  grid  point 

d  =  (x-p)’  +  (y-9)’. 

If  d  is  not  zero  then  add  1/d  to  Wmn  and  add  Z/d  to  Zj!^. 

If  d  is  zero  and  is  less  than  zero,  then  decrement  w^n  subtract 

Z  from 

If  d  is  zero  and  Wmn  is  greater  or  equal  zero,  then  set  =  —  1  set 

ztv  =  -Z- 

3.  For  each  m  and  each  n  do  the  following: 

If  Wmn  is  positive,  set  =  Z^i^'/w^n- 

Upon  completion,  the  depth  map  will  contain  the  predicted  depth  values 

where  is  non-zero.  If  Wmn  is  zero,  the  location  (m,n)  is  “unassigned”  and  a 
filling  algorithm  such  as  the  one  outlined  in  section  7.1  can  be  applied.  U  we  apply 
the  same  calculations  to  the  diagonal  entries  of  the  certainty  matrix  S^,  the  algorithm 
also  accomplishes  the  prediction  of  the  certainty  matrix. 


Filter  Update:  Depth  from 
Motion  Using  Optical  Flow 

Chapter  8 


In  this  chapter  we  will  explore  the  application  of  the  temporal  surface  recon¬ 
struction  algorithm  to  the  problem  of  estimating  depth  from  motion  using  optical 
flow.  A  corresponding  instantaneous  algorithm  that  accomplishes  this  task  when 
given  a  single  measurement  of  the  optical  flow  was  described  in  section  2.2.7.  Only 
the  update  stage  of  the  recursive  estimator  is  dependent  on  the  visual  mechanism 
being  employed.  Therefore,  the  focus  in  this  chapter  is  exclusively  on  that  part  of 
the  algorithm. 

8.1  The  Update  Algorithm 

When  an  observer  moves  relative  to  a  scene,  each  scene  point  can  be  assigned  an 
instantaneous  velocity  in  space.  The  projection  of  this  velocity  field  into  the  image 
plane  of  the  observer  is  called  the  optical  flow  and  can  be  represented  by  a  vector 
{uij^Vij)  at  every  pixel  location  (i,  j).  The  relationship  between  the  optical  flow,  the 
motion  t  =  [17,  V,  W]^,  u>  =  [A,B,C]^  and  the  inverse  depth  d  =  IjZ  is  given  by 
the  Longuet-Higgins/Prazdny  formulas  [59] 

Uij  =  {-U  +  XjW)dij AxjUi  -  B{x] -\- \)  +  Cyi  (8.1) 

‘Vij  =  (-V'  -f-  yjW)dij  +  A{y]  +  1)  -  Bxjyi  -  Cxj  (8.2) 

=  {-V +  yiW)d  +  v:- 

where  (zj,  j/j)  is  the  physical  image  plane  location  of  pixel  (i,  j).  We  will  assume  that 
the  dimensions  of  the  optical  flow  fields  (u,j),  (uij)  and  the  disparity  field  (dtj)  are 
n  X  m. 

The  state  vector  x  is  constructed  by  concatenating  the  rows  of  the  disparity  field 
to  an  nm-dimensional  vector: 

Xim+j  =  dij  J  =  0,...,n  -  1;;  =  0,  ...,m  -  1  (8.3) 
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The  measurement  vector  y  is  constructed  by  combining  the  adjoined  rows  of  both 
the  (uij)  and  (vij)  fields  minus  the  rotational  components  (which  are  independent  of 
depth)  into  a  2n7n-dimensional  vector: 

Vim+j  =  Uij-u^-.  t  =  0, . . . ,  n  -  1  (8.4) 

yim+j+nm  ~  ^ij  V—  J  =  0,  1  (®'5) 

With  these  choices  we  can  write  the  motion  field  equations  (8.1),  (8.2)  for  all 
pixels  in  the  form  of  a  linear  Kalman  filter  measurement  (4.2)  y  =  Cx  where  C  is  a 
2nm  X  nm  matrix  with 

—  U  +  XjW  k  =  im  +  j 

-V -\-yiW  k  =  im-\-j-^nm  (8.6) 

0  otherwise 

Finally  we  must  consider  the  covariance  matrix  R  of  a  measurement  y.  We 

assume  that  an  optical  flow  measurement  (u,j,Ujj)  has  an  error  distribution  that  is 
Gaussian  with  covariance 

Rii  =  [  1  (8-7) 

where  p,  q  and  r  depend  on  the  algorithm  that  is  used  to  compute  the  optical  flow 
measurements  from  image  brightness  values.  An  example  of  such  an  algorithm  and 
the  resulting  covariance  values  is  given  in  section  8.2. 

The  measurement  covariance  matrix  R  can  now  be  constructed  by  collecting  all 
the  covariance  matrices  of  individual  optical  flow  measurements,  under  the  assump¬ 
tion  that  optical  flow  measurements  are  uncorrelated: 

Pij  k  =  im  -h  j,  I  =  im  j 

qij  k  =  im  +  j  +  nm,  I  =  im  -f  J  +  nm 

rij  k  =  im  +  j,l  =  im  j  +  nm  and  k  =  im  j  +  nm,  I  =  im  +  j 

0  otherwise 

(8.8) 

The  above  choices  for  the  state  x,  the  measurement  y,  its  covariance  R  and  the 
measurement  matrix  C  completely  determine  the  update  stage  of  the  Kalman  filter. 
By  plugging  these  values  into  the  update  equations  (4.16),  (4.17),  (4.18),  the  filter 
update  can  be  performed.  The  detailed  equations  which  are  obtained  after  these 
algebraic  manipulations  have  been  summarized  in  appendix  C  so  that  the  reader  can 
easily  implement  the  temporal  surface  reconstruction  from  optical  flow. 

Note  that  the  resulting  Kalman  filter  is  linear.  However,  the  Kalman  filter  con¬ 
vergence  and  optimality  properties  of  section  4.6  do  not  necessarily  apply  for  the 
following  reasons:  The  error  in  the  optical  flow  estimates  will  generally  not  have  a 
Gaussian  distribution.  Moreover,  neighboring  optical  flow  estimates  usually  exhibit 
correlation. 
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The  formulation  of  the  estimation  of  dense  depth  from  optical  flow  in  terms  of 
a  Kalman  Alter  was  first  done  by  Matthies,  Szeliski  and  Kanade  [67]  and  by  Heel 
[37]  as  a  special  case  of  the  above  presentation.  Both  assumed  non-correlation  as  the 
prior  model  (i.e.  all  non-diagonal  elements  of  S  are  zero).  In  this  case  each  disparity 
map  entry  (t,  j)  can  be  updated  independently  from  all  others 

5+  =  5-  -I-  a^p  -H  b^q  +  2abr  (8.9) 

x'*’  =  x~  -1-  [a(pu  -i-  ru  —  ad)  -f-  6(r'u  qv  —  bd)]/ S'^  (8.10) 

where  we  have  omitted  the  subscripts  i,j  and  have  abbreviated  a  =  U  +  XjW  and 
6=1^-!-  piW.  Consequently,  the  iterative  relaxation  algorithm  is  avoided  and  only 
a  small  number  of  multiplies  and  adds  are  required  at  each  pixel.  Both  formulations 
then  tried  to  impose  the  surface  smoothness  constraint  by  explicitly  smoothing  the 
depth  map  after  the  update  stage  of  the  Alter. 


8.2  Computation  of  Optical  Flow  and  its  Covari¬ 
ance 

The  temporal  surface  reconstruction  from  optical  flow  requires  that  the  optical  flow 
fleld  has  be  calculated  at  every  pixel  Algorithms  that  accomplish  this 

task  have  been  studied  in  detail:  The  differential  method  by  Horn  and  Schunck  [47], 
the  edge-based  approach  of  Hildreth  [42],  Nagel  and  Enkelmann’s  enhanced  second- 
order  differential  approach  [71],  Heeger’s  spatio-temporal  Alters  [33],  and  Anandan’s 
correlation-based  scheme  [2]  are  prominent  examples.  The  recursive  estimator  for 
temporal  surface  reconstruction  does  not  require  the  use  of  any  one  particular  algo¬ 
rithm.  However,  the  optical  flow  algorithm  employed  will  determine  the  uncertainty 
in  the  measurement  provided  to  the  Kalman  Alter  and  hence  the  covariance  matrix 
R.  For  this  reason  and  for  the  sake  of  providing  an  example,  a  matching  optical  flow 
algorithm  similar  to  the  one  by  Anandan  is  described  here  and  the  derivation  of  the 
measurement  covariance  R  is  given. 

Let  us  begin  by  recalling  that  the  optical  flow  is  an  estimate  of  the  projection 
of  the  3-D  velocity  fleld  into  the  image  plane.  An  optical  flow  fleld  will  therefore 
contain  a  vector  (Ujj,u,j)  for  every  point  P  projected  into  the  image  describing  which 
point  P'  it  projects  to  in  the  next  image. 

The  main  idea  that  is  exploited  in  the  sum-of- squared  differences  (SSD)  optical 
flow  is  depicted  in  Agure  8.1.  For  a  given  point  P  =  (i,y)  in  image  1  we  wish  to 
determine  where  it  moves  to  in  image  2.  We  assume  two  things: 

•  The  interframe  displacements  do  not  exceed  a  certain  number  of  pixels  in  each 
dimension. 
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Figure  8.1:  Computation  of  SSD  optical  flow 

•  The  brightness  in  an  area  surrounding  the  point  remedns  approximately  un¬ 
changed  by  the  motion. 

The  first  assumption  leads  to  the  concept  of  a  search  area  which  is  an  area  of  pixels 
in  image  2  that  we  will  consider  as  possible  correspondences  for  P.  The  second 
assumption  introduces  the  sum  of  squared  differences.  As  a  measure  of  how  well  P 
corresponds  to  each  candidate  point  P'  in  image  2  we  will  use  the  difference  between 
a  neighborhood  surrounding  P  and  the  corresponding  neighborhood  around  P'.  The 
measure  is  computed  as  the  sum  of  the  squared  differences  of  corresponding  pixel 
values  over  the  entire  neighborhood.  More  formally,  if  we  let  Ei{i,j)  denote  the 
brightness  value  at  location  P  =  (i,j)  in  image  1  and  Eiiiyj)  a  brightness  value  in 
image  2  then  for  every  P  =  (z,  j)  in  image  1  we  seek  a  P'  =  (i  -f  v,  j  4-  tt)  in  image  2 
such  that 

min  SSD{u,v)  =  min  V!  +  0  —  +  ^  +  +  w  -t-  /)]’.  (8.11) 

u,vtS  u,veS 


N  is  called  the  neighborhood,  5  is  referred  to  as  the  search-area.  The  displacement 
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(u,v)  is  called  the  optical  flow  vector  at  point  P.  ^ 

The  displacement  (u,v)  that  minimizes  (8.11)  is  determined  only  to  pixel  accu¬ 
racy.  It  can  be  improved  to  sub-pixel  accuracy  as  follows:  consider  the  surface  of  the 
sum-of-squared  differences  SSD{u^v).  In  searching  for  the  minimum  of  (8.11),  we 
have  calculated  samples  of  this  surface  at  the  integral  grid  point  locations  within  the 
search  area  5.  Suppose  that  (u,v)  was  the  integral  displacement  found  to  yield  the 
minimal  vadue  of  the  SSD  function.  The  true  sub-pixel  displacement  is  the  minimum 
of  the  continuous  SSD  surface  and  is  located  between  (u,  v)  and  the  neighboring 
integral  displacements  u  —  l,u  -|- 1  and  v  —  l,v  -I- 1  respectively.  We  will  determine 
the  minimum  of  the  continuous  SSD-surface,  by  fitting  a  quadratic  function  to  the 
minimum  SSD  and  its  neighbors  and  then  analytically  determining  its  minimum. 

We  fit  a  one- dimensional  quadratic  function  to  the  samples  in  both  the  u  and  v 
dimensions  respectively.  ^  An  example  of  an  SSD-surface  with  the  neighborhoods 
indicated  as  well  as  the  corresponding  interpolation  in  the  u  direction  is  shown  in 
figure  8.2. 
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Figure  8.2:  Subpixel  interpolation  of  the  optical  flow 


'The  displacement  (u,v)  that  minimizes  (8.11)  is  in  terms  of  indices  in  the  image  array.  The 
optical  flow  in  terms  of  physical  distances  is  easily  obtained  by  multiplication  with  the  pixel  distance 
(A*,  Ay). 

^We  could  flt  a  two-dimensional  quadratic  to  the  sample  points  in  the  immediate  neighborhood, 
but  this  surface  may  have  a  minimum  outside  the  neighborhood  and  may  therefore  not  always  be 
practical. 
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More  formally  we  fit  the  quadratic 

a(i)  =  ax^  +  6®  +  c  (8.12) 

to  the  three  samples  s_i,  so,  <Si  where  the  origin  is  located  at  the  center  of  the 
neighborhood  (x  stands  for  either  u  or  v  dimension  here).  The  coefficients  of  the 
best-fitting  surface  are 

0-  =  ^(^1  +  «-i)  - -»o  (8.13) 

b  =  i(ji-«-i)  (8.14) 

c  =  sq.  (8.15) 

Once  these  coefficients  have  been  determined  we  find  that  the  subpixel  minimum  of 

the  surface  is  located  at 

x'  =  ~b/2a  (8.16) 

and  the  value  there  is 

s{x')  —  c  —  I?  jAa.  (8.17) 

From  our  formulation,  x'  is  a  value  between  —1  and  1  and  is  simply  added  to  the 
integral  value  of  the  optical  flow  determined  from  (8.11). 

In  their  work,  Matthies,  Szeliski  and  Kanade  [65]  give  an  elegant  derivation  of 
the  variance  in  an  optical  flow  estimate  obtained  by  the  SSD  method.  Their  method 
was  restricted  to  translational  camera  motions  perpendicular  to  the  optical  axis,  so 
that  optical  flow  would  have  only  a  u  component.  As  a  consequence,  the  derivation 
of  the  variance  was  also  restricted  to  this  case,  but  can  be  extended  to  the  general 
case  presented  here  as  follows. 

We  model  the  images  and  Ei{i,j)  as  originating  from  a  single  noise- 

free  image  with  additive,  uncorrelated  Gaussian  noise  of  variance  (t\;,  where  E2  is 
displaced  by  (u,w)  at  location  (t,j)  with  respect  to  Ei 

Ei{iJ)  =  E{i,j) +  ni{i,j)  and  Ejii  +  v,j  +  u)  =  E{i,j)  +  n2{i,j)  (8.18) 

Now  the  expression  for  the  SSD  error  (8.11)  becomes  ^ 

SSD{u,v)  =  ^k,lcN  [E{i  +  k,j  +  1}  -  E{i  +  V  -  V k,j -i- u  -  u -h  1} -i- 

ni{i  +  k,j  +  /)  -  712(1  +  k,j  +  /)]*.  (8.19) 

By  Taylor  series  expansion  of  E  around  {u,v)  and  retaining  the  linear  terms,  we  have 

SSD{u,  v)  =  a(u  —  u)^  +  b{u  —  u){v  —  v)-\-c{v  —  v)^  +  d{u  —  u)  +  e{v  —  v)  +  f  (8.20) 

*As  is  pointed  out  in  (65),  the  last  term  is  actually  n2(»  +v-v  +  k,j  +  u-u  +  l)  but  can  safely 
be  approximated  as  is  done  here. 
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where 


a  = 

H  -Ex(i  +  k,j  +  1) 

kJiN 

(8.21) 

6  = 

2  53  Ex{i  +  k,j  +  l)Ey(i  +  kj  +  0 

k,leN 

(8.22) 

c  = 

53  Eiii  +  k,j  +  0 

k,t€N 

(8.23) 

d  = 

2  ^3  ^x(*  k,j  +  f)(ni{t  k,j  1)  —  713(1  -f  k,j-\- 1)) 

kJeN 

(8.24) 

e  = 

2  ^v(*  +  ^^3  +  j  +  0  -  +  0) 

k,l€N 

(8.25) 

f  = 

53  (ni(i  +  k,j  +  /)  -  713(1  +  k,j  +  /))*. 

(8.26) 

kdeN 


Ex  and  Ey  denote  a  suitable  discrete  approximation  to  the  partial  derivatives  of  E 
in  the  x  and  y  directions. 

The  minimum  of  the  quadratic  error  function  (8.20)  is  located  at 


u  —  u 

1 

be  —  2cd  1 

V  —  V 

4ac  —  6* 

1 

(8.27) 


For  the  computation  of  the  covariance  matrix  of  this  estimate  we  note  ^hat  only  the 
quantities  d  and  e  contain  the  noise  processes  and  are  stochastic  while  o,  b  and  c  are 
deterministic.  From  (8.24),  (8.25)  we  compute  the  covariances 


<^d  - 

cr*  =  8(t|.c 
cov{d,  e)  =  4(7^6 

and  from  there  the  covariance  matrix  of  an  optical  flow  vector 


P  r 

b 

1 

2c(4ac  —  b^) 

b^ 

r  q 

4ac  —  6^)^ 

b^ 

2a(4ac  —  6*) 

(8.28) 

(8.29) 

(8.30) 


(8.31) 


is  readily  obtained.  This  covariance  matrix  is  used  in  the  construction  (8.8)  of  the 
update  stage  for  the  temporal  surface  reconstruction  from  optical  flow. 

Note  that  the  derivation  of  the  measurement  covariance  described  here  is  specific 
to  the  SSD  optical  flow  estimation  algorithm.  However,  the  procedure  of  propagating 
the  noise  in  the  brightness  measurements  through  the  equations  of  the  optical  flow 
algorithm  can  be  applied  to  other  methods  as  well. 
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8.3  Experimental  Evaluation 

In  this  section  I  will  present  the  results  of  experiments  with  the  temporal  surface 
reconstruction  algorithm  using  optical  flow.  In  both  cases,  a  CCD  camera  with  a 
focal  length  of  10  mm  translated  relative  to  the  scene. 

8.3.1  Bottle  Experiment 

In  the  first  experiment  (first  presented  in  [37])  a  the  camera  translates  vertic^dly  over 
a  scene  consisting  of  a  small  spray  bottle  on  a  table  before  a  fiat  background.  The 
bottle  was  730  mm  away  from  the  camera,  the  background  was  1000  mm  away.  The 
camera  translated  5  mm  between  each  of  the  7  frames  in  the  sequence. 


Figure  8.3:  The  first  two  images  the  bottle  sequence. 

Figure  8.3  shows  the  first  two  images  from  the  bottle  sequence.  Figure  8.4  shows 
the  optical  flow  fields  computed  from  the  first  two  image  pairs  of  the  bottle  sequence 
using  the  matching  optical  flow  algorithm  described  above  in  section  8.2.  Variance 
data  was  also  computed  as  described  in  that  section.  From  the  optical  flow,  we  can 
see  the  translational  motion  of  the  camera  in  a  vertical  direction. 

This  data  is  used  in  the  simplified  reconstruction  scheme  (8.9)  in  which  non¬ 
correlation  of  pixels  is  assumed  as  a  prior  model  and  a  separate  binomial  filtering  step 
is  performed  between  update  and  prediction  stage  of  the  filter  to  enforce  smoothness. 
The  initizd  depth  map  is  flat  with  a  value  of  900  mm.  The  reconstructed  structure 
of  the  scene  after  every  iteration  of  the  algorithm  is  shown  as  a  wire  frame  in  figure 
8.5  from  left  to  right  and  top  to  bottom. 


8.3:  Experimental  Evaluation 
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Figure  8.4:  The  first  two  optical  flow  fields  from  the  bottle  sequence. 

The  run-time  of  the  filtering  algorithm  was  approximately  1  minute  per  &ame 
for  the  300  by  300  image  on  a  Sun  3/60  if  the  optical  flow  has  been  precomputed. 

8.3.2  Pepsi  Experiment 

In  a  second  experiment,  a  soda  can  was  placed  on  a  table  at  570  mm,  before  a 
background  parallel  to  the  image  plane  at  1240  mm.  The  sequence  of  frames  taken 
by  the  camera  is  shown  in  figure  8.6.  The  camera  translated  t  =  [1.5, 0,0]  mm 
between  frames. 

From  this  image  sequence  a  sequence  of  optical  flow  fields  was  precomputed  using 
the  matching  algorithm  described  previously  in  section  8.2.  The  first  three  optical 
flow  fields  are  shown  in  figure  8.7  and  the  translational  motion  is  evident  to  the 
human  observer.  Note  that  the  optical  flow  fields  are  rather  noisy,  in  particular  in 
the  image  regions  of  fairly  uniform  brightness,  as  no  smoothness  constraint  has  been 
imposed. 

The  disparity  map  state  vector  was  initialized  to  a  value  of  1/Zo  =  1/1000  ev¬ 
erywhere.  The  Kalman  filter  temporal  surface  reconstruction  scheme  was  applied  to 
the  optical  flow  data  using  the  thin  plate  prior  model  of  surface  smoothness.  Fourty 
Gauss-Seidel  iterations  were  used  at  each  time  step.  With  these  parameters,  each 
temporal  iteration  takes  about  30  seconds  on  a  Sun  Sparcstation  I  if  optical  flow  has 
been  precomputed  for  the  200  by  200  images.  Figure  8.8  shows  a  perspective  wire¬ 
frame  rendering  of  the  three-dimensional  structure  recovered  using  this  algorithm 
after  each  iteration.  A  closeup  of  the  final  result  is  shown  in  figure  8.9.  It  is  note- 
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worthy,  that  considerable  error  remains  even  after  10  iterations  and  it  is  particularly 
prominent  where  the  input  optical  flow  contained  errors.  This  is  simply  a  reflection 
of  the  fact  that  image  areas  of  nearly  uniform  brightness  allow  little  information 
about  structure  to  be  estimated  from  motion. 


8.3:  Experimental  Evaluation 
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Figure  8.5:  The  structure  recovered  &om  the  bottle  sequence  after  each  iveration  of 
the  Kalman  filter  depth  from  motion  algorithm  from  left  to  right  and  top  to  bottom. 


Figure  8.6:  The  first  9  images  from  the  pepsi  sequence  from  left  to  right  ^tud  top  to 
bottom. 
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Figure  8.7:  The  first  three  optical  flow  fields  from  the  pepsi  experiment. 
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Figure  8.8:  Wire  frame  renderings  of  the  structure  recovered  after  each  of  the  first  9 
iterations  of  the  temporal  surface  reconstruction  algorithm  from  the  optical  flow  of 
the  pepsi  sequence. 
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Figure  8.9:  A  closer  look  at  the  structure  recovered  after  the  9th  iteration  of  the 
temporal  structure  estimator  using  optical  flow  on  the  pepsi  sequence 
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Chapter  9 


In  this  chapter  we  will  explore  the  application  of  the  temporal  surface  reconstruc¬ 
tion  algorithm  to  the  problem  of  estimating  depth  from  shading.  A  corresponding 
instantaneous  2dgorithm  that  accomplishes  this  task  when  given  a  single  image  was 
described  in  section  2.2.9.  Since  only  the  update  stage  of  the  recursive  estimator  is 
dependent  on  the  visual  mechanism,  the  focus  in  this  chapter  is  exclusively  on  that 
part  of  the  algorithm. 

9.1  The  Update  Algorithm 

In  shape  from  shading  [46]  it  is  assumed  that  the  image  brightness  E  observed  at  a 
location  on  the  surface  is  a  known  function  of  the  surface  normal  [— p,  — g,  1] 

E  =  R{p,q).  (9.1) 

where  R  is  called  the  reflectance  function.  Reflectance  functions  have  been  deter¬ 
mined  for  a  number  of  surfaces  such  as  the  lunar  surface  and  Lambertian  surfaces. 
By  noting  that  p  =  and  q  =  Zy  (the  partial  derivatives  of  depth)  and  choosing  a 
discrete  approximation  we  obtain 

2Ax  ’  2Ay  ^ 

This  formula  can  be  used  as  the  nonlinear  measurement  equation  y  =  f(x)  (4.8)  if 
we  construct  the  state  x  by  concatenating  the  rows  of  the  depth  map 

Xjm+j  ~  Zxj  1  =  0,  ...,n  1»J  —  0,...,  7Tl  —  1  (9.3) 

and  the  measurement  vector  y  by  concatenating  the  rows  of  the  image  brightness 
fleld 

yim+j  -  Eij  i  =  0,...,n  -  1;  j  =  0,...,m  -  1.  (9.4) 


(9.2) 
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The  matrix  C  =  ^  is  found  to  be 

’  Rp/2Ax 

k  =  tm  -f  j '  +  1 

-Rpl2Ax 

k  =  im  -f  j  —  1 

Cim+j,k  —  ' 

Rj2Ay 

A:  =  (*  -1-  l)m  +  j 

(9.5) 

-Rgl2Ay 

k  =  li-  l)Tn  -1-  j 

[  0 

otherwise 

where  iZp  and  Rg  denote  the  partials  of  R  with  respect  to  p  and  q. 
Finally,  the  covariance  of  the  measurement  is 


R  =  41  (9.6) 

if  we  assume  brightness  measurements  to  be  uncorrelated  at  different  pixels  and 
identically  distributed  with  variance  4- 

Special  thought  must  be  given  to  the  filter  initialization,  as  the  measurement 
9.2  and  therefore  the  filter  update  are  a  function  of  derivatives  of  the  depth  Z. 
Initializing  the  state  vector  to  a  constant  value  as  suggested  in  chapter  6  is  useless 
in  this  case,  as  it  constitutes  a  local  minimum  in  the  solution  space  in  which  the 
iterative  update  scheme  (5.8)  becomes  stuck.  The  solution  is  to  initialize  the  depth 
to  random  numbers  which  will  guarantee  that  the  derivatives  p,  q  will  be  non-zero. 

As  with  all  shape  from  shading  algorithms,  convergence  and  local  minima  are  a 
particular  difficulty.  The  implementation  must  take  into  account  integrability  of  the 
derivatives  and  boundary  conditions  among  others.  In  dealing  with  these  problems, 
I  have  followed  the  suggestions  of  Horn  [45];  the  paper  contains  a  careful  discussion 
of  implementation  details  for  shape  from  shading  algorithms. 

A  noteworthy  point  is  the  fact  that  conventional  shape  from  shading  is  formulated 
for  orthographic  projections  while  the  prediction  stage  of  the  temporal  reconstruction 
scheme  assumes  perspective  projection.  It  is  possible  to  reformulate  the  shape  from 
shading  algorithms  for  perspective  projection  at  the  cost  of  increased  mathematical 
complexity.  In  my  implementation,  I  instead  reformiilated  the  prediction  stage  of 
the  reconstruction  scheme  to  work  with  orthographic  projection. 


9.2  Experimental  Evaluation 

The  implementation  of  this  Kalman  filter  for  structure  from  shading  (first  presented 
in  Heel  [41])  uses  the  thin  plate  (6.1)  as  the  prior  model. 


9.2.1  Sphere  Experiment 

For  this  experiment  a  synthetic  image  of  a  semi-sphere  on  a  planar  background  was 
created  using  the  Lambertian  shading  model.  Brightness  values  are  in  the  range  [0, 1] 
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and  Gaussian  noise  of  variance  0.05  was  added.  The  camera  translated  ujy  in 
both  in  both  x  and  y  directions. 

Figure  9.1  shows  the  first  8  frames  from  the  sequence  of  sphere  images.  Note 
that  these  images  are  only  50  by  50  pixels  in  size  and  have  been  enlarged  better 
inspection.  Note  also  the  visible  effect  of  noise  in  the  images. 

The  temporal  surface  reconstruction  scheme  was  applied  to  the  spl  e  .equence 
with  natural  boundary  conditions  imposed.  Fifty  Gauss-Seidel  iterations  were  used 
per  frame.  Figure  9.2  shows  wire  frame  renderings  of  the  structure  obtained  after  each 
iteration  of  the  temporal  reconstruction  scheme.  Figure  9.3  compares  the  structure 
obtained  after  the  eighth  iteration  of  the  algorithm  with  the  ground  truth. 

Since  a  s}rnthetic  sequence  was  used,  the  ground  truth  structure  was  known  and 
can  be  compared  quantitatively  to  the  estimate.  Figure  9.4  shows  the  development 
of  the  root  mean  squared  error  of  the  estimate  with  respect  to  the  ground  truth  as 
a  function  of  the  frame  number. 

The  convergence  behavior  of  the  temporal  depth  &om  shading  algorithm  is  of 
particular  interest.  I  compared  it  with  Horn’s  [45]  Height  and  Gradient  from  Shading 
on  a  single  frame.  Horn’s  algorithm  will  typically  require  about  1000  iterations  and  it 
may  diverge  for  some  illumination  directions  of  the  sphere.  The  temporal  algorithm 
will  diverge  in  the  same  cases,  however  when  it  converges,  it  reqmred  only  about 
50  iterations  per  frame.  This  is  of  course  mainly  due  to  the  fact,  that  initialization 
of  the  iterative  update  procedure  for  each  frame  uses  the  predicted  data  from  the 
previous  temporal  iteratian  while  the  single-frame  algorithm  ‘^starts  from  scratch”. 

When  noise  is  added,  as  in  the  above  experiment,  the  temporal  algorithm  con¬ 
verges  even  when  the  single-frame  algorithm  does  not  and  the  result  was  in  all  cases 
closer  to  the  ground  truth  than  the  single-frame  algorithm.  This  is  to  be  expected, 
as  the  temporal  algorithm  can  use  redundant  measurements  to  reduce  the  effect 
of  noisy  measurements  while  the  single-frame  algorithm  can  only  impose  a  surface 
smoothness  constraint  to  eliminate  noise. 


9.2.2  Crater  Experiment 

In  this  experiment,  images  of  Mars  taken  by  the  Viking  orbiter  were  analyzed.  Figure 
9.5  shows  a  sequence  of  images  of  a  crater  on  the  surface  that  were  taken  from  a 
larger  image  to  simulate  a  translatory  motion  of  the  orbiter  (recall  that  we  are  using 
orthographic  projection).  A  vertical  translation  corresponding  to  about  one  pixel 
per  frame  was  used. 

The  temporal  surface  reconstruction  algorithm  was  applied  to  this  sequence  using 
fixed  boundary  constraints  (i.e.  the  boundary  was  constrained  to  a  constant)  and 
40  Gauss-Seidel  iterations  per  frame.  The  light  source  direction  was  estimated  at 
(P»>9i)  =  (0»10)  this  estimate  was  verified  by  evaluating  the  resulting  shading. 
The  wire  frame  rendering  of  the  recovered  structure  after  each  time  step  is  shown  in 
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Figure  9.2:  Wireframe  renderings  of  the  first  8  structure  estimates  for  the  sphere 
sequence. 
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Figure  9.4:  Root  mean  squared  error  of  depth  over  hrame  number. 

figure  9.6  from  a  viewpoint  located  just  above  the  surface.  A  closeup  of  the  result 
after  nine  iterations  is  shown  in  figure  9.7. 

Since  the  ground  truth  structure  is  not  known  in  this  case,  a  quantitative  evalu¬ 
ation  of  the  result  of  the  estimation  is  difficult.  A  method  commonly  used  in  shape 
from  shading  in  this  case  is  to  shade  the  reconstructed  surface  using  the  same  light 
source  direction  that  W2is  used  in  the  reconstruction  and  to  compare  the  resulting 
image  with  the  input  image.  As  Horn  points  out  in  [45],  it  is  also  imperative  to 
additionally  shade  the  surface  using  other  light  source  directions.  Figure  9.8  shows 
the  result  of  shading  the  structure  estimate  from  the  ninth  iteration  using  both  the 
light  source  direction  (0,10)  used  for  reconstruction  and  the  light  source  direction 
(0,-10)  that  illuminates  the  surface  from  ‘‘above”.  These  images  can  be  compared 
with  the  last  one  in  the  sequence  of  figure  9.1 
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Figure  9.5:  The  first  9  images  from  the  Mars  crater  sequence. 
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Figure  9.6:  Wireframe  rendering  of  the  structure  estimates  from  the  Mars  crater 
sequence  after  each  iteration. 


98 


Chapter  9:  Filter  Update:  Depth  from  Shading 


Figure  9.7:  A  closer  look  at  the  structure  recovered  after  the  9th  iteration  of  the 
temporal  structure  estimator  using  shading  on  the  the  crater  sequence 


9.2:  Experimental  Evaluation 
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Figure  9.8:  The  result  of  shading  the  structure  estimate  from  the  ninth  iteration  of 
the  temporal  reconstruction  scheme  with  the  original  light  source  direction  and  a 
light  source  positioned  on  the  opposite  side  of  the  crater. 


Filter  Update:  Direct  Depth  from 
Motion 

Chapter  10 


In  this  chapter  we  will  explore  the  application  of  the  temporal  surface  reconstruc¬ 
tion  algorithm  to  the  problem  of  estimating  depth  from  motion  without  the  use  of 
optical  flow.  A  corresponding  instantaneous  algorithm  that  accomplishes  this  task 
when  given  a  single  measurement  of  the  optical  flow  wzis  described  in  section  2.2.8. 
The  focus  in  this  chapter  is  exclusively  on  the  update  stage  of  the  recursive  estimator, 
as  it  is  the  only  part  that  is  dependent  on  the  visual  mechanism. 

10.1  The  Update  Algorithm 


Based  on  the  brightness  constancy  assumption 


(10.1) 


Horn,  Negahdaripour  and  Weldon  [75],  [48]  derived  a  relationship  that  links  image 
brightness  E  directly  to  motion  t,  u  and  disparity  d  and  obviates  the  need  for 
expensive  computation  of  the  optical  flow: 


(s  •  t)d -I- V  •  w -I- Ft  =  0  (10-2) 

where  s  =  [— F,,-Fy,iF,  -t-  yEy]^  and  v  =  [(1  +  y^)Ey  +  xyEg,-{\  +  i*)F»  - 
xyEy,yEx  —  xEy]'^.  Given  at  least  two  frames  E  and  the  motion  t,  u  we  can 
compute  the  partial  dertivatives  F,,  Fy,  Ft  and  hence  the  vectors  s,  v.  Then  d  is 
easily  computed. 

We  can  construct  a  state  vector  x  by  concatenating  all  the  rows  of  the  disparity 
map 

Xifn+j  —  ^ij  i  =  0,*..,n  —  Ijj  =0,...,  TTl  —  1  (10.3) 

The  measurement  vector  y  is  comprised  of  the  brightness  values  Eijk  from  two  se¬ 
quential  images  ^  =  0, 1,  so  that  spatial  and  temporal  brightness  derivatives  can  be 
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computed: 


yfcmn+*m+j 


i  =  0,  ...,n  —  =  0,  ...,m  —  l;fc  =  0, 1  .  (10.4) 


For  the  sake  of  simplicity  we  will  assume  a  simple  discrete  derivative  operator  that 
can  easily  be  replaced  with  a  more  elaborate  one: 


r.  /  •  -x  - -Em-I.O 

-  - iTx - 

-  - iTy - 

F  (i  I'i 

- 


(10.5) 

(10.6) 
(10.7) 


where  Az,  Ay  denote  the  pixel  distances  and  At  is  the  time  between  frames. 

With  these  choices  we  see  that  the  measurement  equation  (10.2)  is  of  the  implicit 
type  (4.12)  g(x,y)  =  0.  The  matrix  C  =  dg/dx  is  diagonal  with 


-  Ey{i,i)V  +  {xE^{i,j)  +  yEy{i,j))W. 


(10.8) 


The  matrix  D  =  dg/dy  is  sparse  and  banded  with  6  bands: 

-  j^((-t7  +  xW)dij  +  Axy  -  B{1  +  i’)  +  Cy)  /  =  im  +  j  -  1 

5^((-C/  +  xW)dij  +  Axy  -  5(1  +  i*)  +  Cy)  /  =  tm  +  j  +  1 

-  +  ^(1  +  y")  -  -  C'*)  1  =  (t  -l)m+j 

Dim+j,i  =  -  +  xW)dij  +  >1(1  -I-  y^)  -  Bxy  -  Cx)  I  =  {i  +  l)m  +  j 

—h  l  =  im-\-j 

i  I  =  im  +  j  +  nm 

0  otherwise 

(10.9) 

If  we  assume  our  measurement  values  E  to  be  identically  distributed  Gaussian 
random  variables  with  variance  <tI;  then  the  measurement  covariance  is  given  by 

R  =  41 

The  above  choices  for  the  state  x,  the  measurement  y,  the  measurement  matrices 
C  and  D  as  well  as  the  measurement  covariance  R  completely  determine  the  update 
stage  of  the  implicit  Kalman  filter  and  the  update  can  be  preformed  by  plugging 
these  values  into  the  equations  (4.16),  (4.17)  (4.18).  ^ 


'Note  that  the  update  equations  are  slightly  modified  in  the  case  of  the  implicit  filter  as  described 
in  section  4.4. 


10.2:  Alternative  Formulation  of  the  Update  Algorithm 
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10.2  Alternative  Formulation  of  the  Update  Al¬ 
gorithm 

In  designing  the  Ka.lma.Ti  filter  the  choice  of  the  depth  as  the  state  and  the  bright¬ 
ness  as  the  measurement  in  the  above  formulation,  leads  to  the  implicit  nonlinear 
filter.  Other  choices  are  possible  and  may  lead  to  other  properties  of  the  estimation 
algorithm.  One  such  alternative  is  presented  here. 

If  we  solve  the  brightness  constancy  constraint  (10.2)  for  the  depth  Z  =  Ijd 


Z  = 


s  •  t 
V  •  u>  -f- 


(10.10) 


we  can  precompute  an  estimate  of  the  depth  map  values.  Now  we  designate  the 
depth  values  Z,j  to  be  not  only  the  state  x 


®tm+j  —  ^  —  0,...,TI  l;ji  —  0,  ...,Tn  1 

but  also  the  measurement  y 

yim+j  ~  I  “  0, . . . ,  n  1>  J  —  0>  •  •  •  >  ^  !• 


(10.11) 

(10.12) 


Now  the  relationship  between  state  and  measurement  is  given  by  the  measurement 
matrix  C  which  is  simply  the  identity  matrix.  Through  this  choice  of  the  measure¬ 
ment  vector  the  filter  actually  becomes  linear  (4.4)! 

The  disadvantage  of  this  formulation  is  that  the  covariance  R  of  the  measurement 
is  more  complex.  To  obtain  an  estimate  of  the  variance  in  a  given  depth  value  Z  we 
will  propagate  the  noise  in  the  brightness  measurements  E  through  equation  (10.10). 
We  assume  that  the  brightness  E  at  every  pixel  is  corrupted  by  Gaussian  noise  n  of 
variance  ol;  that  is  identically  distributed  at  every  pixel  and  mutually  uncorrelated 
between  pixels. 

First  we  express  the  depth  Z  from  (10.10)  explicitly  in  terms  of  the  brightness 


derivatives: 

^  aEx  +  hEy 

Et  -l-  cEx  -l"  dEy 

(10.13) 

where 

a  =  fU-xW 

(10.14) 

b  =  fV-yW 

^  (10.15) 

c  =  {xyA  -  {f^  +  x^)B)f f +  yC 

(10.16) 

d  =  ((/’  -f-  y^)A  -  xyB)/f  -  xC 

(10.17) 

Recall  that  /  is  the  focal  length,  t  =  [U,  V,  and  =  [A,  B,  . 
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If  the  discrete  approximations  for  the  derivatives  Ex,Ey,Et  suggested  by  Horn 
[44]  are  used,  eight  brightness  values  ,  t  =  1 , . . .  8  at  the  corners  of  a  spatio-temporal 
cube  contribute  to  the  value  of  Z  in  (10.13) 

Z  =  f{Eu...,Es)  (10.18) 


Under  the  assumption  that  the  nonlinear  function  /  can  be  locally  approximated  by 
the  first-order  terms  of  its  Taylor  series,  the  depth  variance  is  given  by 


,2  _  ^2  \a 


(10.19) 


If  we  apply  this  formula  to  our  expression  (10.13)  for  the  depth,  the  variance  is 
found  to  be 

2  _  ^  ^  \a/  \2  ,  /_L\2/.££.\2\ 

where 


dZ  _  aEt  +  {ad  —  bc)Ey 
dEx  {El  +  cEx  +  dEy)^ 
dZ  _  bEt  -  {ad  -  bc)Ex 

dEy  ~  {El  +  CEx  +  dEyY 
dZ  aEx  *1'  bEy 

dEy  {El  -|-  CEx  +  dEy)^ 


(10.21) 

(10.22) 

(10.23) 


If  we  assume  that  depth  map  values  are  uncorrelated,  the  measurement  covariance 
R  is  diagonal 

Rim+j,im+j  =  (Tij.  (10.24) 

Actually,  with  the  discrete  derivative  operator  chosen  above,  the  depth  value  Zij  is 
correlated  with  its  eight-connected  neighbors.  The  correlation  can  be  computed  in 
the  same  way  as  the  variance  calculation  shown  here. 


10.3  Experimental  Evaluation 

This  section  shows  the  results  of  experiments  with  the  alternative  implementation 
of  the  temporal  reconstruction  scheme  for  direct  structure  from  motion.  The  same 
camera  and  experimental  setup  previously  described  in  section  8.3  were  used.  The 
pepsi  experiment  below  uses  the  same  sequence  of  images  introduced  in  that  section, 
so  that  an  immediate  comparison  of  the  two  structure  from  motion  techniques  is 
possible.  The  thin  plate  model  of  surface  smoothness  was  used  with  between  20  and 
50  Gauss-Seidel  iterations  per  frame. 


10.3:  Experimental  Evaluation 
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10.3.1  Wave  Experiment 


Figure  10.1:  The  wave  experiment  scene 

This  first  experiment  was  designed  to  evaluate  the  performance  of  the  temporal 
reconstruction  scheme  quantitatively.  A  simple  planar  scene  was  created  by  mounting 
a  poster  (to  obtain  the  desired  texture  for  motion)  on  a  wall  parallel  to  the  camera’s 
image  plane  at  a  distance  of  1000  mm.  The  camera  translated  by  t  =  [1.5, 0,3.0] 
mm  relative  to  the  surface.  The  last  of  a  sequence  of  10  frames  taken  by  this  camera 
is  show  in  figure  10.1  . 

The  recovered  structure  after  1  and  10  iterations  of  the  temporal  surface  recon¬ 
struction  algorithm  is  shown  in  figures  10.2.  Since  the  thin-plate  model  of  surface 
smoothness  favors  fronto- parallel  surfaces  and  would  therefore  lead  to  a  misrepresen¬ 
tation  of  the  noise  reduction  effect  achieved  by  temporal  integration,  it  was  turned 
off  in  this  experiment.  Figure  10.3  shows  the  development  of  the  root  mean  squared 
error  with  respect  to  the  ground  truth  as  a  function  of  the  frame  number. 


10.3.2  Pepsi  Experiment 

This  experiment  uses  identically  the  same  images  as  the  one  described  in  section  8.3. 
The  result  of  applying  the  temporal  reconstruction  algorithm  directly  to  this  motion 
sequence  as  opposed  to  computing  the  optical  flow  is  shown  in  figure  10.4.  A  closeup 
look  at  the  wireframe  rendering  of  the  structure  obtained  after  the  eighth  iteration 
is  shown  in  figure  10.5.  It  is  noteworthy  that  each  iteration  of  the  the  filter  takes 
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Figure  10.2:  Wireframe  renderings  of  the  structure  from  the  wave  scene  after  1  and 
10  iterations  of  the  filter. 

approximately  30  seconds  on  the  200  by  200  images.  This  is  the  same  time  required 
by  the  optical  flow  based  algorithm  with  the  difference,  that  in  the  latter  case,  the 
optical  flow  must  also  be  computed.  Depending  on  the  parameters  of  the  optical 
'  flow  algorithm,  this  may  require  up  to  20  minutes! 

10.3.3  Cup  Experiment 

This  experiment  was  designed  to  evaluate  the  performance  of  the  temporal  recon¬ 
struction  algorithm  on  a  complex  scene.  A  cup,  a  set  of  books  and  a  staple- remover 
were  placed  on  a  table  before  a  planar  background  on  which  a  poster  was  mounted. 
A  top  view  of  the  layout  of  the  experiment  is  shown  in  figure  10.6.  The  camera 
translated  t  =  [2,0,4]  mm  between  frames  to  acquire  the  sequence  of  images  shown 
in  figure  10.7. 

Figure  10.8  shows  wire-frame  renderings  of  the  structure  estimates  after  each 
one  of  the  temporal  iterations.  A  closeup  of  the  structure  after  the  ninth  iteration 
is  shown  in  figure  10.9.  Note  that  what  looks  like  a  woman’s  face  in  the  image  is 
acu  tally  a  poster  and  appears  flat  in  the  depth  map  while  a  small  spoon  in  the  cup 
that  is  barely  visible  in  the  image  shows  up  clearly  in  the  structure  rendering. 


10.3:  Experimental  Evaluation 
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Figure  10.3:  Development  of  root  mean  squared  depth  error  as  a  function  of  the 
frame  number. 
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Figure  10.4:  Wire  frame  renderings  of  the  structure  recovered  after  each  of  the  first  8 
iterations  of  the  temporal  surface  reconstruction  algorithm  from  the  pepsi  sequence. 


10.3:  Experimental  Evaluation 
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Figure  10.5:  A  closer  look  at  the  structure  recovered  after  the  8th  iteration  of  the 
temporal  structure  estimator  using  direct  motion  on  the  the  pepsi  sequence 
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Figure  10.6:  A  top  view  of  the  scene  layout  for  the  cup  experiment. 


lO.S:  Experimental  Evaluation 
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Figure  10.7:  The  first  9  images  from  the  cup  sequence. 
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Figure  10.8:  Wire  frame  rendering  of  the  structure  recovered  from  the  cup  sequence 
after  each  temporal  iteration. 


lO.S:  Experimental  Evalxiaiion 
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Figure  10.9:  A  closer  look  at  the  structure  recovered  after  the  9th  iteration  of  the 
temporal  structure  estimator  using  direct  motion  on  the  the  cup  sequence 
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Chapter  11 


This  chapter  discusses  the  temporal  surface  reconstruction  scheme  from  several 
different  perspectives.  The  first  section  is  devoted  to  the  analysis  of  complexity  and 
run-time  of  the  algorithm.  The  second  investigates  the  possibility  of  implementing 
the  algorithm  on  a  parallel  processor  and  analyzes  the  complexity  of  such  an  imple¬ 
mentation.  The  third  section  summarizes  the  assumptions  and  approximations  made 
in  applying  recursive  estimation  to  the  temporal  surface  reconstruction  problem. 

11.1  Computational  Complexity  and  Run-Time 

To  analyze  the  serial  complexity  of  the  temporal  reconstruction  algorithm,  let  us 
consider  the  update  and  prediction  stages  separately  and  focus  on  a  single  iteration 
(time-step)  of  the  filtering  algorithm.  As  before  we  will  assume  that  the  image  and 
the  depth  map  have  dimensions  nxm.  The  state  and  measurement  vectors  x,  y  in  the 
algorithms  presented  in  this  thesis  contain  O(nTn)  values.  The  associated  covariance 
matrices  R,S  as  well  as  the  measurement  matrix  C  have  O(n^m^)  elements  but  due 
to  their  sparse  nature,  only  O(nTn)  are  non-zero. 

The  update  stage  (4.16),  (4.17),  (4.18)  requires  multiplications  and  additions  of 
the  above  vectors  and  matrices.  Due  to  the  sparse  and  banded  nature  of  the  matrices, 
this  can  be  accomplished  in  time  O(nTn).  In  addition,  the  matrix  S  must  be  inverted, 
which  is  achieved  by  the  iterative  Gauss-Seidel  process  (5.8).  One  iteration  of  this 
algorithm  requires  time  0(nm)  so  that  the  overall  complexity  of  the  update  stage 
is  0{knm)  where  k  is  the  number  of  iterations  used  in  the  Gauss-Seidel  process.  In 
our  experiments,  k  was  a  small  number,  usually  20  to  50. 

The  prediction  stage  requires  for  each  point  in  the  depth  map  to  be  warped 
and  then  the  depth  map  must  be  resampled.  The  warping  processes  each  depth 
map  entry  once  and  therefore  has  complexity  0{nm).  In  the  resampling  ^tep  each 
of  the  2(n  —  l)(Tn  —  1)  warped  triangular  surface  facets  could  in  the  worst  case  be 
intersected  by  all  of  the  nm  rays  through  grid  point  locations  leading  to  a  run  time  of 
O(n^m^).  This,  however,  is  clearly  a  degenerate  case  that  can  be  ruled  out  for  small 
motions  and  real  visual  surfaces  where  a  bounded  small  number  of  ray  intersections 
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per  triangulation  facet  can  be  safely  assumed  to  reduce  the  expected  run  time  to 
0(nm).  For  convex  surfaces,  for  example,  a  ray  can  intersect  the  surface  at  most 
two  times.  The  approximative  prediction  algorithm  presented  in  section  7.3  has  a 
worst-case  run-time  of  0{nm). 

In  summary,  the  worst-case  complexity  of  the  temporal  surface  reconstruction 
algorithm  is  0{n^m}).  For  aU  practical  purposes,  however,  the  expected  run-time 
is  only  0(nm)  per  time-step.  Note  that  this  complexity  for  a  single  measurement 
is  the  same  as  any  one  of  the  instantaneous  surface  reconstruction  procedures  from 
chapter  2. 

In  terms  of  the  actual  run-time  for  the  implementations,  the  following  results  were 
obtained.  The  implementations  were  done  on  a  Sun  SPARCstation  I.  On  an  image 
of  size  256  x  256,  one  iteration  of  the  temporal  surface  reconstruction  takes  between 
20  and  30  seconds  depending  on  the  visual  mechanism  used  for  the  filter  update. 
This  time  is  almost  evenly  divided  between  the  update  and  prediction  stages  where 
the  time  spent  in  the  update  stage  is  proportional  to  the  number  of  Gauss-Seidel 
iterations.  This  is  significant,  as  we  have  seen  in  the  depth-from- shading  example,  in 
comparing  the  run-time  with  the  repeated  application  of  an  instantaneous  procedure. 
In  the  latter  case,  the  Gauss-Seidel  algorithm  “starts  from  scratch”  for  each  new 
measurement  and  may  require  a  large  number  of  iterations  to  converge.  In  the  former 
case  of  temporal  surface  reconstruction,  the  Gauss-Seidel  process  in  the  update  stage 
can  be  initialized  with  the  predicted  depth  map  from  the  previous  time-step  and  will 
require  considerably  less  iterations. 

As  a  consequence,  for  all  practical  purposes,  the  temporal  surface  reconstruc¬ 
tion  algorithm  is  computationally  less  expensive  than  the  repeated  application  of  an 
instantaneous  surface  reconstruction. 


11.2  Parallel  Implementations 

The  temporal  surface  reconstruction  procedure  can  benefit  greatly  from  an  imple¬ 
mentation  on  a  parallel  processor  as  we  will  see  in  this  section.  We  will  investigate 
the  complexity  of  an  implementation  on  a  SIMD  processor  such  as  the  Connection 
Machine  (TM). 

The  update  stage  can  be  implemented  efficiently  by  arranging  the  processors  in 
a  two-dimensional  grid  and  assigning  one  processor  to  each  pixel  of  the  image/depth 
map.  Then  the  Jacobi  method  (see  Golub  and  Van  Loan  [30])  can  be  used  to  solve 
the  sparse  matrix  inversion  problem  (5.7).  The  update  of  a  given  pixel  reqtiires 
interaction  with  the  four- connected  neighbors.  By  our  arrangement  of  processors,  all 
values  of  the  depth  map  can  be  updated  at  once,  so  that  one  Jacobi  iteration  can 
be  computed  in  constant  time.  Therefore,  the  update  stage  of  the  temporal  surface 
reconsruction  takes  only  0{k)  time  where  k  is  the  number  of  Jacobi  iterations  used. 


11.2:  Parallel  Implementations 
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For  the  prediction  stage,  we  begin  with  the  same  arrangement  of  processors  as 
above.  Each  processor  computes  the  warping  (7.3)  of  the  point  corresponding  to  its 
depth  map  entry.  Obviously,  this  can  be  accomplished  in  a  single  time-step.  Now 
the  resulting  depth  value  must  be  propagated  to  the  new  location  (i,j)  in  the  depth 
map  for  the  resampling.  Assuming  that  each  processor  can  communicate  with  its 
four- connected  neighbors,  this  may  take  time  0(n  -I-  m)  (in  the  same  worst-case 
scenario  described  for  the  serial  case  above)  but  for  real  surfaces  and  small  motions 
this  propagation  will  extend  over  only  a  small  number  of  processors  (pixels).  After 
the  propagation,  each  processor  determines  its  new  depth  value  by  intersecting  a  ray 
through  its  location  with  the  triangle  facet  given  by  the  depth  values  that  have  been 
propagated  to  it  (or  by  the  simplified  weighted  scheme).  This  is  again  a  constant-time 
operation. 


In  summary,  a  parallel  SIMD  implementation  of  the  temporal  surface  reconstruc¬ 
tion  algorithm  will  require  computation  time  proportional  to  the  number  of  iterations 
k  used  in  the  update  stage.  I  am  aware  of  one  effort  to  actually  carry  out  such  an 
implementation  at  the  ETH  Zurich.  Most  importantly,  a  SIMD  implementation  such 
as  the  one  above  is  amenable  to  implementation  directly  on  a  single  chip.  Among  the 
existing  implementations  of  vision  algorithms  on  single  chips,  the  analog  VLSI  ap¬ 
proach  of  Mead  [69]  is  most  easily  extended  to  the  recursive  estimation  task  required 
for  temporal  surface  reconstruction.  The  "artificial  retina”  can  perform  relaxation 
algorithms  such  as  the  one  needed  for  the  update  procedure  of  the  Kalman  filter  in 
a  truly  time-continuous  fashion. 


Another  more  common  type  of  specialized  hardware  can  effectively  support  the 
temporal  surface  reconstruction  procedure.  Graphics  workstations  such  as  the  Silicon 
Graphics  (TM)  provide  a  fast  storage  called  the  Z-buffer  that  can  be  used  to  hold 
the  depth  map  state  vector  and  can  be  accessed  quickly.  The  resampling  step  in 
the  prediction  stage  is  equivalent  to  the  z-buffered  rendering  of  a  surface  given  as  a 
triangular  mesh.  This  is  a  common  operation  in  these  systems  and  is  supported  by 
special  hardware. 


Since  the  Kalman  filter  operations  are  linear,  it  is  not  surprising  that  the  nec¬ 
essary  computations  can  be  implemented  on  a  network-type  architecture.  Yeates 
[110]  describes  how  a  Kalman  filtering  algorithm  can  be  implemented  on  a^ully  con¬ 
nected  two-layer  network  by  having  the  network  use  Newton’s  algorithm  to  perform 
the  necessary  matrix  inversions.  Network  implementations  are  of  particular  interest 
because  of  the  high  speeds  that  can  be  achieved  due  to  massively-parallel  processing 
and  because  of  the  similarity  to  neuronal  computation  mechanisms  found  in  humans. 
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11.3  Assumptions  and  Approximations 

In  applying  the  Kalman  filter  to  the  temporal  surface  reconstruction  problem,  we 
have  made  a  number  of  assumptions  and  approximations.  In  some  cases,  these  sim¬ 
plifications  cause  some  of  the  useful  properties  of  the  filter  (see  section  4.6)  to  be 
lost.  Additionally,  it  is  important  to  understand  these  approximations  in  order  to 
evaluate  the  temporal  surface  reconstruction  scheme  and  its  limitations.  In  what  fol¬ 
lows,  the  assumptions  and  approximations  made  throughout  the  thesis  are  compiled, 
evaluated  and  possible  remedies  are  discussed. 

1.  Extended  Kalman  filter: 

Two  of  the  visual  mechanisms  investigated  in  this  thesis,  structure  from  shading 
in  chapter  9  and  the  first  approach  to  structure  from  motion  without  optical 
flow  10,  section  10.1  model  the  relationship  between  state  and  measurement 
as  a  non-linear  one.  This  necessitates  the  use  of  the  extended  Kalman  filter 
or  the  implicit  Kalman  filter.  From  section  4.6  we  recall  that  these  versions  of 
the  filter  are  only  approximative  and  convergence  is  not  guaranteed. 

In  some  cases,  it  is  possible  to  formulate  the  filter  in  the  desired  linear  form 
by  making  a  different  choice  for  the  measurement  vector  y  or  the  state  x.  An 
example  is  the  alternative  temporal  reconstruction  scheme  for  direct  structure 
from  motion  in  section  10.2.  By  choosing  the  measurement  vector  to  consist 
of  the  depth  Z  directly,  the  relationship  to  the  state  vector  became  linear 
and  trivial.  On  the  other  hand,  however,  the  derivation  of  the  measurement 
covariance  R  requires  another  Taylor  series  approximation.  In  summary:  we 
can  trade  off  non-linearities  in  the  measurement  relationship  for  non-linearities 
in  the  measurement  covariance. 

2.  Correlation  of  measurements: 

In  determining  the  covariance  matrix  R  of  the  measurement  vector,  we  have 
neglected  the  effect  of  correlation  between  elements  of  the  measurement  vector 
in  the  case  of  structure  from  motion  using  optical  flow  chapter  8  and  in  one  of 
the  cases  of  structure  from  motion  without  optical  flow  chapter  10,  section  10.2. 
This  approximation  simplifies  the  computation,  since  it  ignores  off-diagonal 
elements  of  the  matrix  R  and  makes  the  inversion  of  R  required  in  the  update 
stage  of  the  filter  (4.17)  much  easier.  However,  as  we  know  from  section  4.6, 
the  convergence  of  the  filter  is  no  longer  guaranteed  under  this  approximation. 

There  are  two  possible  solutions  to  this  problem.  The  first  is  to  make  a  different 
choice  for  the  elements  of  the  measurement  vector  y.  This  may  lead  to  a  much 
simpler  covariance  matrix  as  in  the  example  of  direct  structure  from  motion 
chapter  10,  section  10.1.  This  is  the  tradeoff  discussed  under  the  previous  item. 
The  second  solution  is  to  explicitly  compute  the  covariances  and  carry  them 


11.3:  Assumptions  and  Approximations 


119 


through  the  computation.  It  would  be  interesting  to  investigate,  whether  this 
additional  effort  results  in  significantly  improved  estimates. 

3.  Noise  distribution: 

The  optimality  of  the  linear  Kalman  filter  requires  the  noise  distribution  to  be 
Gaussian.  In  both  of  the  linear  filters  described  in  this  thesis  (structure  &om 
motion  using  optical  flow  chapter  8  and  structure  form  motion  without  optical 
flow  chapter  10,  section  10.2)  the  measurement  quantities  (optical  flow  and 
depth)  will  in  general  not  have  a  Gaussian  distribution,  so  that  the  optimality 
property  is  lost. 

As  mentioned  in  section  4.6,  the  Gaussianity  of  measurement  noise  is  neces¬ 
sary  to  show  optimality  of  the  Kalman  Alter  among  all  possible  estimators.  If 
a  Gaussian  noise  model  is  not  appbcable,  there  may  be  a  non-linear  estima¬ 
tor  that  outperforms  the  Kalman  Alter,  but  it  remains  optimal  among  linear 
estimators.  This  assumption  is  therefore  not  a  crucial  one. 

4.  Prediction  Resampling: 

The  resampling  algorithm  in  the  prediction  stage  (chapter  7)  approximates  the 
surface  as  being  planar  between  sampling  locations.  This  will  introduce  an  error 
into  the  depth  estimates,  the  magnitude  of  which  depends  on  the  frequency 
content  of  the  surface.  This  will  influence  the  convergence  property  of  the 
Alter,  in  particular  it  may  prevent  the  estimate  from  converging  completely  to 
the  ground  truth. 

It  is  possible  to  improve  the  prediction  stage  to  reduce  the  effect  of  this  approx¬ 
imation.  We  could,  for  example,  do  a  bilinear  or  bicubic  approximation  of  the 
surface  within  a  given  facet.  Such  an  approximation,  although  computationally 
more  expensive,  would  also  be  compatible  with  the  smoothness  assumption  on 
the  surface  structure.  Whether  or  not  a  more  accurate  prediction  stage  is  ap¬ 
propriate  depends  largely  on  the  accuracy  of  the  estimates  obtainable  with  a 
given  visual  mechanism.  The  experimental  results  in  this  thesis  indicate  that 
for  structure  from  motion,  the  errors  as  a  result  of  the  visual  mechanism  are 
far  larger  than  any  errors  introduced  by  approximations  within  the  Alcer  while 
in  structure  from  shading  this  may  not  be  the  case.  Note  that  a  small  decrease 
in  the  certainty  value  of  each  depth  estimai  is  used  to  explicitly  represent  the 
error  introduced  by  the  prediction  stage. 

5.  Certainty  Prediction: 

The  prediction  (4.20)  of  the  certjunty  matrix  S  will  in  general  not  preserve 
the  sparse  and  banded  nature  of  this  matrix.  Since  this  property  is  crucial 
to  maintain  computational  manageability,  the  change  in  off-diagonal  elements 
was  neglected  by  the  separation  in  equation  (6.8).  This  may  cause  the  Alter  to 
lose  its  optimality  and  even  its  convergence  property. 


120 


Chapter  11:  Features  and  Faults 


It  will  be  very  difRcrilt  to  relax  this  assumption,  since  the  computational  feasi¬ 
bility  depends  on  it.  There  are  two  arguments  to  support  this  approximation. 
First,  it  can  be  understood  as  a  viewpoint-independent  prior  model  of  surface 
structure  and  therefore  has  some  justification  in  a  physical  sense.  Second,  the 
experiments  show  that  smooth  surfaces  can  be  recovered  with  this  approxima¬ 
tion.  It  may  therefore  be  a  tolerable  one  for  practical  purposes. 

In  addition  to  these  assumptions  and  approximations  made  in  the  temporal  recon¬ 
struction  algorithm  itself,  there  are  two  other  sources  of  error  that  must  be  weighed 
carefully  against  the  ones  cited  above  in  deciding  where  to  begin  with  improvements. 

The  first  is  a  systematic  error  in  the  modeling  of  the  visual  mechanism.  In  the 
structure  from  shading  case  (chapter  9)  for  example,  the  relationship  between  image 
brightness  and  surface  gradients  is  only  approximately  modeled  by  the  reflectance 
function  (9.1)  and  does  not  follow  this  model  exactly  in  real  images.  In  the  case  of 
structure  from  motion  without  optical  flow,  the  relationship  between  brightness  gra¬ 
dients  and  depth  is  only  approximately  given  by  the  brightness  constancy  assumption 
(10.2).  These  systematic  errors  depend  on  the  particular  image  under  consideration 
and  can  far  outweigh  the  effect  of  the  approximations  made  in  constructing  the  filter 
algorithm. 

The  second  is  the  approximate  nature  of  the  prior  surface  model.  The  ground 
truth  surface  will  generally  not  obey  a  membrane  or  thin-plate  model  so  that  the 
smoothness  constraint  may  actually  drive  the  solution  away  from  the  true  values.  In 
particular,  real  surfaces  contain  depth  discontinuities  that  cannot  be  modeled  in  this 
simple  way  and  are  therefore  inaccurately  recovered.  While  there  exist  methods  for 
incorporating  discontinuities  into  the  prior  surface  model  (see  section  3.7),  they  are 
computationally  more  expensive  and  do  not  eliminate  the  adverse  effect  of  an  error 
in  the  prior  model. 

The  analysis  of  these  approximations  and  assumptions  leads  to  one  conclusion: 
theoretically  derived  properties  of  the  temporal  surface  reconstruction  scheme  will 
be  limited  in  their  practical  applicability.  Experimentation  must  show  how  useful 
the  method  is  for  a  given  visual  mechanism. 
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The  temporal  surface  reconstruction  method  presented  in  this  thesis  provides  a 
way  in  which  low-level  “instantaneous”  structure  estimation  procedures  for  various 
visusd  mechanisms  can  be  embedded  in  a  recursive  estimation  framework  and  thereby 
applied  continuously  to  a  sequence  of  frames.  In  the  introduction  chapter  1  we  listed 
three  requirements  that  a  time-continuous  structure  estimation  procedure  should 
meet.  Here  we  evaluate  the  temporal  surface  reconstruction  method  with  respect  to 
those  criteria: 

1.  Quality  improvement: 

The  update  stage  of  the  filter  incorporates  each  measurement  in  such  a  way, 
that  the  resulting  state  has  minimal  error  variance.  In  the  ideal  filter  the 
variance  is  guaranteed  to  always  decrease  over  time. 

2.  Motion  transformations: 

The  prediction  stage  of  the  Ksdman  filter  accounts  for  possible  camera  trans¬ 
formations  between  frames. 

3.  Uncertainty  representation: 

Uncertainty  is  represented  explicitly  by  covariance/certainty  matrices  and  the 
update  stage  that  combines  old  estimate  and  new  measurement  to  produce  a 
new  estimate  weights  its  inputs  with  their  covariances  to  take  this  uncertainty 
into  account. 

4.  Computational  simplicity: 

The  temporal  surface  reconstruction  runs  faster  than  the  repeated  application 
of  instantaneous  procedures  while  producing  estimates  of  higher  quality. 

Although  recursive  estimation  meets  ail  of  our  initial  requirements,  we  have  seen 
that  the  nonlinearity  of  some  visual  mechanisms  forces  us  to  make  assumptions  that 
may  cause  some  of  the  desirable  theoretical  properties  of  this  scheme  to  be  lost. 
Under  such  circumstances,  experimentation  plays  an  important  part  in  verifying  the 
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validity  of  assumptions  and  approximations  and  I  believe  the  results  presented  here 
to  be  at  least  encouraging. 

The  appealing  feature  of  this  approach  is  the  fact  that  it  provides  not  only  a 
uniform  theoretical  framework  for  temporal  surface  reconstruction  but  also  yields  al¬ 
gorithms  of  manageable  complexity  with  robust  results  on  real  images.  In  addition, 
it  constitutes  a  unifying  theoretical  framework  for  previous  work  in  surface  recon¬ 
struction  from  sequences  of  images.  Previous  work  on  incremental  estimation  of 
dense  structure  by  Matthies,  Szeliski  and  Kanade  (see  section  3.5)  and  Heel  (see  sec¬ 
tion  3.8)  can  be  understood  as  the  application  of  simplified  versions  of  the  temporal 
surface  reconstruction  scheme  to  specific  visual  mechanisms. 

Future  work  in  temporal  structure  estimation  will  proceed  along  five  major  lines: 

1.  Opportunities  for  relaxing  the  current  set  of  assumptions  and  approximations 
necessary  to  apply  recursive  estimation  to  the  temporal  surface  estimation 
problem  (see  section  11.3)  will  be  explored.  This  can  be  achieved,  for  ex¬ 
ample,  through  new  choices  for  the  measurement  vector,  improved  prediction 
schemes  and  enhanced  modeling  of  uncertainty.  Other  sources  of  error  such  as 
the  prior  surface  model  and  the  modeling  of  the  visual  mechanism  will  also  be 
the  subject  of  improvement  efforts. 

2.  Alternative  representations  for  structure  information  will  be  investigated  under 
the  recursive  estimation  framework.  One  example  is  the  voxel  representation 
or  the  octree  representation  (see  Szeliski  (95),  [94]).  They  can  help  overcome  a 
major  disadvantage  of  all  of  the  previous  representations:  they  can  represent 
structure  information  that  is  not  currently  in  view  of  the  camera  but  was 
actually  acquired  much  earlier  in  the  image  sequence  and  then  disappeared 
from  view.  A  second  example  are  feature-based  representations  and  how  they 
can  be  understood  as  a  sampling  of  a  dense  representation. 

3.  Other  visual  mechanisms  will  be  embedded  into  the  temporal  framework  and 
the  models  for  exisiting  visual  mechanisms  will  be  enhanced  so  that  a  refor¬ 
mulation  of  the  embedding  may  become  necessary.  In  particular,  the  problem 
of  sensor  fusion  in  which  structure  information  is  obtained  from  several  vi¬ 
sual  mechanisms  or  non- visual  sensors  can  be  solved  in  an  elegant  way  using 
the  temporal  reconstruction  mechanism  by  simply  expanding  the  measurement 
vector  y  to  include  all  the  measured  quantities. 

4.  The  temporal  reconstruction  scheme  will  find  its  way  into  impleinentations 
on  parallel  processors  and  special-purpose  hardware.  Ideally,  the  processing 
should  be  done  immediately  after  the  image  acquisition  and  possibly  on  the 
same  chip.  All  indications  are  that  this  task  is  feasible,  but  much  more  work 
is  needed. 
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5.  Ultimately,  the  temporal  reconstruction  scheme  can  only  prove  its  usefulness 
when  it  becomes  part  of  a  functioning  system.  This  will  require  that  the 
structure  information  produced  in  a  temporally  continuous  manner  is.  utilized 
by  some  other  algorithm  that  solves  a  particular  application  problem  such  as 
the  navigation  of  a  vehicle  or  the  identification  of  a  target.  The  navigation 
problem  specifically  is  the  subject  of  ongoing  research  by  the  author. 

Hopeful]}',  the  theoretical  and  experimental  results  in  this  thesis  will  provide  a 
basis  for  addressing  these  problems  and  contribute  to  a  better  understanding  of  the 
temporal  nature  of  visual  analysis. 
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In  this  appendix  I  will  derive  the  equations  for  the  implicit  Kalman  filter  intro¬ 
duced  in  section  4.3.  The  measurement  equation  of  a  dynamical  system  is  given 
by 

g(yfc-Vfc,Xfc)  =  0.  (A.l) 

where  v*  ~  ./V(0,Rfc).  We  postulate  a  linear  update 

=  K  +  Kfcg(^fc  ,y)  (A.2) 

where  K*  will  be  chosen  to  minimize  the  expected  length  of  the  error  vector  ej  = 
Xfc  —  Xfc.  With  the  help  of  the  state  covariance 

«■((':)(«:  )’■]  =  £!(»?  -  x,xsit  -  X*)''!  =  pt  (A.3) 

we  can  formulate  the  optimization  problem  as  follows:  determine  the  matrix  K*  that 
minimizes  J  =  trace{P^). 

Under  the  postulated  update  equation  {A.2)  the  estimation  error  is 

=  xj -Xfc  =Xfc  -l-Kfcg(x;,y)-Xfc.  (A.4) 

The  error  covariance  becomes 

p;  =  P;  +  £Kg(».-.y.)|Kr  +  K.£;|g(ic;,y)(e;)’'i+  (a.s) 

To  find  the  optimal  Kfc  we  differentiate  J  with  respect  to  Kfc  and  equate  the 
result  to  zero.  We  obtain 

Kfc  = -F[€fcg^(x;,yfc)](F(g(x;,y)g^(Xfc,y)])-*  (A.6) 

This  can  be  simplified  by  Taylor  series  expansion  of  g: 

g(x*,yfc  -  Vfc)  =  g(x;,yfc)  -i-  -  Xfc  )  -H  •  (A.7) 
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Recall  from  (A.l)  that  the  left  hand  side  is  zero  and  neglect  the  higher  order  terms 
to  obtain 


g(*i:,y*)  =  Cki±j;  -  Xfc)  +  DfcV*  (A.8) 


where 


Cfc  = 


and 


(A.9) 


Now  we  can  substitute  the  linear  approximation  (A.8)  into  the  expression  for  the 
gain  (A.6) 

K*  =  -P]fcnC*P;Cl  +  DiRiDf]-*  (A.IO) 


where  we  have  used  the  fact  that  =  0  and  )(«*  )^]  =  P*  •  The  lin¬ 

earized  approximation  also  simplifies  the  expression  ( A.5)  for  the  updated  covariance 


p:  =  (I  -  KfcCOPfc- 


(A.ll) 
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Figure  A.l:  State  estimate  and  variance  from  the  implicit  Kalman  filter  experiment. 

To  illustrate  the  operation  of  the  implicit  Kalman  Alter,  I  have  simulated  the 
system 

Xfc+i  =  Xk  (A. 12) 

xkVk  -f  C  =  0  ^  (A. 13) 

with  a  true  value  x  =  5,  the  constant  C  =  10  and  a  measurement  noise  of  standard 
deviation  <Ty  =  0.1  (2  percent).  The  filter  was  initialized  to  xo  =  4  and  run  for  500 
iterations.  The  resulting  state  and  variance  estimates  are  shown  in  figure  A.l. 
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The  prediction  stage  of  the  Kalman  filter  for  temporal  surface  reconstruction 
transforms  the  current  estimate  and  its  covariance  from  one  time-step  to  the  next 
and  thereby  accounts  for  interframe  motions  of  the  camera  with  respect  to  the  scene. 
Chapter  7  describes  in  detail  how  the  prediction  of  the  state  (depth  values)  can  be 
accomplished,  but  gives  only  a  simple  approximation  for  the  corresponding  trans¬ 
formation  of  the  covariances  in  section  7.2.  Although  this  approximation  is  used  in 
most  of  the  experiments  and  is  computationally  much  simpler,  it  is  possible  to  deter¬ 
mine  the  predicted  values  of  the  covariance/certainty  more  accurately  by  propagating 
them  through  the  depth  prediction  equations.  The  derivation  here  takes  advantage 
of  the  viewpoint-independent  model  of  surface  smoothness  and  therefore  only  treats 
the  diagonal  of  the  certainty  matrix  S  =  P“*.  Moreover,  since  most  results  about 
random  variables  are  in  terms  of  variances,  the  derivation  is  in  terms  of  variances  p 
which  are  the  inverses  of  the  diagonal  entries  in  S. 

It  remains  to  determine  the  variance  values  of  the  warped  depth  map.  In  essence, 
the  warped  values  of  Z  are  some  function  of  the  input  values  of  Z  i.e.  the  output 
value  is  a  random  variable  which  is  some  function  of  several  input  random  variables. 
More  formally:  we  interpret  the  given  values  of  Z(xj,y,)  as  normally  distributed 
random  variables  with  variances  p{xj,yi).  What  are  the  value  of  after  the 

warping? 

B.l  Variance  Propagation 

Let  us  first  establish  some  basic  facts  about  propagation  of  variances.  Let  Zi^Z^ 
be  two  uncorrelated  random  variables  with  variances  Then  the  random 

variable 

Z  =  aZi  -I-  6Z2  (B.l) 

has  the  variance 

O’!  =  -t-  b^o'z2  (®*2) 
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The  only  assumption  made  here  is  that  Zi^Zj  are  uncorrelated. 

In  the  more  general  case  Z  can  be  an  arbitrary  function  of  several  random  vari¬ 
ables,  say 

Z^fiZr.Z,).  (B.3) 

In  this  case  we  approximate  /  by  its  Taylor  series  around  the  point  of  interest  and 
neglect  all  but  the  first-order  terms.  Then  we  apply  the  above  rule  for  a  linear 
combination  of  random  variables  to  obtain 

where  the  dertivatives  must  be  evaluated  at  the  particular  point  (Zi,  Z})  of  interest. 
This  relationship  is  easily  extended  to  n  independent  variables.  The  assumptions 
made  here  are  zero  correlation  between  Zi  and  Z}  and  the  fact  that  /()  can  be 
approximated  by  the  Rrst  terms  of  its  Taylor  series  near  ZijZj. 


B.2  Variances  of  the  warped  depth  values 

Having  established  how  variances  propagate  through  functions  we  need  merely  deter¬ 
mine  the  functional  relationship  between  input  and  output  depth  values  and  propa¬ 
gate  the  variances  through  these  functions.  Let  us  determine  where  in  our  algorithm 
we  actually  compute  an  output  value  of  Z  and  how  it  is  computed. 

There  are  two  ways  in  which  an  output  value  of  Z  can  be  computed.  The  first 
is  by  resampling  in  step  3  of  the  algorithm.  There  we  compute  the  intersection  of 
the  ray  through  a  pixel  location  (x,y)  with  a  spatial  triangle.  The  spatial  triangle 
is  determined  by  the  3D  coordinates  of  the  corner  points  {Xl,Y/,Zl)  for  t  =  1,2,3. 
So  our  output  value  of  Z  is  some  function  (which  is  determined  by  the  interpolation 
procedure)  of  the  corner  coordinate  values.  Each  corner  coordinate  is  the  result  of 
warping  one  point  (Xi,Yi,Zi)  from  the  input  surface.  The  warping  function  is  simply 
a  linear  combination  of  the  input  point  coordinates.  Finally  we  note  that  the  X  and 
Y  components  of  each  original  point  are  obtained  by  inverse  perspective  projection 

Xi  =  ^  and  Yi  =  ^.  (B.5) 

Thus,  each  Z  value  obtained  in  step  3  of  the  algorithm  is  a  function  of  three  depth 
values  Zi,Z2,Z3  in  the  input  depth  map.  We  determine  this  functional  relationship 
below. 

The  second  possibility  is  when  the  depth  value  is  obtained  by  extraptdating  in 
step  4  of  the  algorithm.  In  this  case  a  depth  value  Z  is  computed  as  the  average  of 
some  depth  values  Zi,...Zk  in  its  immediate  neighborhood.  This  is  a  fairly  simple 
linear  relationship.  The  variance  propagation  for  this  case  is  also  discussed  in  detail 
below. 
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B.2.1  Variances  of  interpolated  depth  values 

We  determined  above  that  each  output  depth  value  Z  is  a  function  of  three  in¬ 
put  depth  values  Ziy  Zx^  Zx-  Let  us  begin  by  determining  the  relationship  between 
these  values  and  the  corresponding  values  {X[,Y-l,  Z[),  Zj),  (X3,  Y3,  Z^)  after 

warping.  We  combine  the  equations  of  motion 


\xn 

■  U  ■ 

‘  -1  -C  B  ' 

■  Xi  • 

=  — 

V 

— 

C  -1  -A 

Yi 

\ 

w 

eH 

1 

1 

_ 1 

(B.6) 


with  the  inverse  perspective  projection 


- 

^  ~  / 


to  obtain 

x;  =  -u  +  KiZi 

y:  =  -v  +  b^z, 

Z\  =  -W  +  b,iZi 

where 

h.,  =  x/f  +  Cy/f-B 
byi  =  -Cx! f  -Iryl f  +  A 
bti  =  Bxlf-Aylf+l. 


(B.7) 

(B.8) 


(B.9) 

(B.IO) 

(B.ll) 


(B.12) 

(B.13) 

(B.14) 


After  the  motion  warping,  we  compute  the  output  depth  by  intersecting  a  ray 
through  a  grid  point  with  a  spatial  triangle.  As  we  recall,  there  are  three  cases  by 
which  Z  can  be  obtained  within  the  interpolation  procedure. 

1.  There  is  exactly  one  point  of  intersection  between  the  ray  and  the  spatial 
triangle. 

2.  The  ray  lies  in  the  same  plane  as  the  spatial  triangle  and  it  has  at  most  one 
point  in  common  with  each  edge  of  the  spatial  triangle. 

3.  The  ray  lies  in  the  same  plane  as  the  spatial  triangle  and  it  coincides  with  one 
edge  of  the  spatial  triangle. 
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In  the  first  case,  the  output  Z  value  is  computed  as 

Z  =  Xf  =  ^f  (B.15) 

where 

D  =  (Xy-X,)l(Z2-Z,)y-(Y,-Y,)f\-  (B.16) 

(K,  -  n)l(2,  -  Z,)x  -  (X,  -  X,)/)  + 

(Z,  -  Z3)((y,  -  y,)x  -  (Jf,  -  X3)si 


Dy  =  -X3[(y,  -  nKz,  -  Z,)  -  (n  -  n)(z.  -  Z3)l  (b.it) 

+y3((A-,  -  X,){Z,  -  Z,)  -  {X,  -  JC3)(Z,  -  Z3)| 

-Z3|(X,  -  X3)(y3  -  y3)  -  (X,  -  X,){Y,  -  y3)l 

where  the  primes  have  been  omitted.  If  we  use  the  above  equations  (B.9)  -  (B.ll) 
to  replace  the  variables  in  these  expressions,  we  find 


D  =  aZiZi  +  hZ\Zi  +  CZ2Z3  (B.18) 

where 

a  =  6,1(6*21/  -  6„2/)  -  6yi(6,2X  -  6,2/)  +  6*i(6y2X  -  6,21/)  (B.19) 

6  =  6*,i(6y3/  -  6*3^)  -  6yi(6x3/  -  6*3x)  +  6*i(6,3y  -  6y3z)  (B.20) 

c  =  -6,3(6,21/  -  6y2/)  +  6y3(6*2Z  -  6,2/)  -  6,3(6y2X  -  6,21/).  (B.21) 

Further  we  have 

Dx  =  aZ\Z2  +  bZiZz  +  cZ2Z'^  +  dZxZ^Z^  -I-  eZiZ\  +  fZ2Z\  (B.22) 

a  =  Uai-  Va2  +  ^^03  (B.23) 

6  =  C/61  -  Vb2  +  1^63  (B.24) 

c  =  C/ci  -  Vc2  +  V^C3  ^  (B.25) 

d  =  —6,301  +  6y3a2  —  6*303  (B.26) 

c  =  —6,361  +  6y362  —  6*363  (B.27) 

/  =  — 6,3Ci  +  6y3C3  —  6*3C3  (B.28) 


where 
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in  which  we  have  abbreviated 


=  f>zib^  —  bylbti 
Cl  =  6y2&z3  —  by3bt2 

0’2  =  bxlbz2  —  bx2bz\ 
hj  =  bxzbzx  —  6*ii,3 
Cj  =  bx2bjc3  ~  bx3bz2 

<^3  =  bx\by2  —  6*2  6yl 

63  =  6*36^1  —  6*1  6y3 

C3  =  bx2by3  —  by2bx3‘ 


(B.29) 

(B.30) 

(B.31) 

(B.32) 

(B.33) 

(B.34) 

(B.35) 

(B.36) 

(B.37) 


As  a  result  of  these  tedious  manipulations  we  are  now  able  to  express  an  inter¬ 
polated  value  of  ^  as  a  function  of  three  input  depth  values: 


Z  =  Z{ZuZ2,Z3)  =  f 


DxjZi,  Z2,  Z3) 

Z)(Zi,  Z2,  Z3) 


(B.38) 


If  the  variances  for  the  input  depth  values  are  Pi,P2>P3»  we  assume  that  they  are 
uncorrelated  and  the  above  functional  relationship  can  be  approximated  by  the  first 
terms  of  the  Taylor  series  we  can  use  the  variance  propagation  (B,4) 


,  ,  dZ  ,2  ,  dZ  .2 


(B.39) 


to  determine  the  output  variance  p.  It  remains  to  determine  the  partial  derivatives 


9Dx  r)  /-)  dP 

AT.  ^  ^XqT 


dZi  hZi  D 


The  partial  derivatives  of  D  are  obtained  easily  from  (B.18) 


(B.40) 


— —  z=  aZx  +  6Z3 

<jZi\ 

dD 

— —  =  aZi  +  CZ3 

O4U2 


(B.41) 

(B.42) 


hZ\  cZx 


(B.43) 
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and  similarly  those  for  Dx  from  (B.22) 

=  0,22  4*  62^  -|-  <£^3^3  4"  c2^  (B.44) 

=  a2i  4"  c^3  4"  d2\22  4"  (B.45) 

=  h2\  4"  c22  4"  d2\22  4"  2e2\23  4"  2f  2223.  (B.46) 

Note  that  a,  b  and  c  are  computed  differently  for  D  and  Dx.  We  must  also  take  care 

not  to  confuse  the  coefficient  /  used  in  the  expression  for  Dx  with  the  focal  length 
/.  This  result  enables  us  to  compute  the  variance  p  of  every  output  depth  value  2 
obtained  by  interpolation  in  the  case  whe’s  the  ray  has  exactly  one  point  in  common 
with  the  spatial  triangle. 

In  the  second  case  listed  above  we  consider  how  a  depth  value  is  obtained  if  the 
interpolation  ray  is  parallel  to  the  spatial  triangle  but  does  not  coincide  with  one  of 
its  edge  segments.  In  this  case  we  recall  that  2  is  obtained  by  interpolating  between 
the  two  endpoints  (Xi,Yi,  2i)  and  of  f^e  triangle  edge  segment: 


II 

(B.47) 

with 

d  =  x{Yx  -  Y2)  -  y{Xx  -  X2) 

(B.48) 

and 

dx  =  {Xx  -  X2)Y2  ~  (Yx  -  Y2)X2. 

(B.49) 

All  coordinate  components  are  after  the  motion  wsirping  although  the  primes  have 
been  omitted.  We  express  both  d  and  dx  in  terms  of  the  depth  values  of  the  endpoints 

2x  and  Z3  before  motion 

warping  as  done  before: 

d  —  ci2x  4"  6^2 

(B.50) 

where 

Q,  =  ^byx  ybxx 

(B.51) 

b  —  ybx2  ^by2 

(B.52) 

and  similarly 

dx  —  (i2\  4"  6^2  4"  c2x22 

(B.53) 

in  which 

\ 

a  =  byxU  -  bxiV 

(B.54) 

b  =  bx2V-by2U 

(B.55) 

c  =  bxxby2  —  byxbx2- 

(B.56) 

dPx 

d2x 

dDx 

d22 

dPx 

623 
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We  compute  the  partial  derivatives 


and 


which  we  need  for 


dd 

dZt 

dd 

dZi 


a 

b 


ddx 

dZi 

dZ2 


a  +  cZj 

b  +  cZi 


dz 

dZi 

dz 

dZ2 


d^ 


Bds. 


d-dxM: 


d* 


Then  the  variance  p  of  the  output  Z  becomes 


(B.57) 

(B.58) 

(B.59) 

(B.60) 

(B.61) 

(B.62) 

(B.63) 


In  the  third  and  last  case  the  interpolation  ray  coincides  with  at  least  one  edge 
segment  of  the  spatial  triangle.  Suppose  the  depth  values  of  the  endpoints  are  Zi, 
Z2  before  motion  warping.  After  motion  warping  they  become 


Z[  =  -W  +  b,xZi  and  Z'  =  -  W  +  6,2  Zj  (B.64) 

Recall  that  the  algorithm  assigned  the  smaller  of  the  two  values  Z{,  Z2  as  the  output 
value  Z.  Hence,  if  ZJ  <  ZJ  then  the  output  variance  would  be 

p  =  (B.65) 

otherwise 

P  =  (B-66) 

From  the  point  of  view  of  implementation  we  see  that  the  computation  of  vari¬ 
ances  cannot  simply  be  added  to  the  depth  computation  outlined  in  the  previous 
section.  A  completely  different  approach  is  necessary  in  order  to  have  the  6zi,6yj,6xt 
values  available  in  the  interpolation  stage.  Although  the  depth  computation  is  iden¬ 
tical  the  abstraction  from  the  underl)dng  geometry  makes  it  less  intuitive. 
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B.2.2  Variances  of  extrapolated  depth  values 

Some  of  the  output  depth  values  are  obtained  through  extrapolation  in  step  4  of 
the  prediction  algorithm.  As  described  above  we  use  a  very  simple  extrapolation 
procedure.  In  which  an  unassigned  depth  value  is  set  to  the  average  of  its  assigned 
neighbors.  This  extrapolation  w<is  motivated  by  assuming  smoothness  of  the  surface. 
Nevertheless,  the  choice  of  variance  at  this  location  should  indicate  the  fact,  that  no 
information  was  available  there  so  that  subsequent  updates  can  contribute  maximally 
to  the  value  at  this  point.  This  is  particularly  important  when  objects  enter  the  field 
of  view  that  exhibit  a  depth  discontiniiity  with  respect  to  the  previously  visible 
objects.  We  therefore  reset  the  variance  vadue  of  extrapolated  depth  map  entries  to 
their  initial  values  (see  chapter  6). 


An  Implementation  Example 

Appendix  C 


This  appendix  describes  the  implementation  of  the  temporal  surface  reconstruc¬ 
tion  <ilgorithm  for  the  structure  from  optical  flow  visual  mechanism  described  in 
chapter  8.  Although  the  update  stage  of  the  filter  is  completely  specified  by  the 
choices  of  state,  measurement  and  the  associated  covariances  in  that  chapter,  the 
execution  of  the  update  procedure  (4.16),  (4.17),  (4.18)  still  requires  some  straight¬ 
forward  manipulations  that  are  shown  here  for  one  example.  Anyone  trying  to  im¬ 
plement  a  temporal  surface  reconstruction  algorithm  will  find  the  details  presented 
here  useful. 

C.l  The  Update  Equations 


Figure  C.l:  The  structure  of  the  matrices  in  the  filter  update  stage 
We  begin  by  simplifying  the  covariance/certainty  update  (4.16) 

St  =  Si  +  C^Ri'C,  (C.l) 
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Rec&ll  that  the  matrix  C  is  a  2nm  x  nm  matrix  and  R  is  a  2nm  x  2nm  matrix  with 
the  sparse  banded  structure  shown  in  figure  C.l.  Because  of  the  sparse  nature  of  the 
matrix  R,  the  inverse  R~^  is  easy  to  obtain:  it  is  sparse  and  banded  with  the  same 
structure  as  R.  The  elements  are 

~  ^i+nm,i+nm/ Di  iot  0  ^  t  <  nm 

nm,«— nm/'Dt— nm  fo^  TITH  ^  ^  2nm 

^■^nm  =  -ri,i+nn.lDi  for  0  <  i  <  71171 

^"+nm.i  =  for  0  <  1  <  71771 

where  1)%  —  7*,'i7*,'^nm,t4’nm  7“,'^nfnjjrj^j4.nm'  Note  that  ~  ^t+nm,i' 

The  matrix  C^R~'C  is  a  nm  x  nm  diagonal  matrix,  so  that  only  the  diagonal 
elements  of  S  are  updated  in  (C.l): 


+  —  -  4"  C^^^^„^ri^nm,t+nm  2C{^rim,i^,i+nm^i,i+nm 


sr.  =  3ii  + 


(C.2) 


The  state  update  (4.17),  (4.18) 

+ (sn-‘crR,-'(y,  -  c,*,-)  (c.3) 

is  performed  through  the  iterative  Gauss-Seidel  solution  of  (5.7) 

p  =  (sit)'‘q  (C.4) 

where  p  =  x^  -  5Cfc  and  q  =  C^R^^y*  -  CfeX^).  To  compute  q,  we  begin  with 

V  =  Xfc  -  CfcXfc 

Vi  =  yi  —  CiiXi  for  0  <  i  <  nm 

Vi  =  Vi  -  Ci^i-nmXi-„m  (oT  nm  <  i  <  2nm 

This  vector  v  is  multiplied  from  the  right  with  CjRj^^,  a  matrix  that  we  have 
already  determined  above  in  the  computation  of  the  covariance  update.  The  resulting 
vector  q  is 

9»  ~  [(Ci+nm,«7*««  Cii'T’i.i+nm )(y«+nm  ^+nm.»3:i)  4"  (^’^y 

(^ii^i+nm.i+nm  ^»,i+nm^>+nm,«  )(y»  C,'»I,')] /(T’i^rj^nni.i+nm  ^t,«+nm^i+nm,» ) 

Having  obtained  q,  we  compute  p  by  applying  the  Gauss-Seidel  algorithm  (5.8)  to 
the  relationship  (C.4)  and  then  the  updated  state  vector  is  just 


Xfc  =  Xfc  4-  p. 


(C.6) 
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C.2  Hints  and  Hacks 

In  this  section,  four  optional  techniques  are  presented,  that  give  you  more  control 
over  the  reconstruction  process  or  improve  the  subjective  quality  of  the  results. 

1.  The  covariance  update  (C.2)  can  result  in  negative  values  in  the  diagonal  of  the 
certainty  matrix  S]^.  This  is  perfectly  normal  if  the  full  matrix  S  is  maintained. 
However,  we  approximate  the  certainty  matrix  by  a  sparse  banded  version  (6.8) 
in  which  only  the  diagonal  varies.  As  a  consequence,  the  diagonal  values  take 
the  role  of  inverse  variances  and  should  be  positive. 

A  positive  sign  of  the  update  certainty  values  can  easily  be  insured  by  first  com¬ 
puting  the  as  in  (C.2)  and  if  the  result  is  negative,  we  set  the  measurement 
covariances  rj,, Vi+nm.i  to  zero  and  recompute  both  s-  and  ft. 

2.  Instantaneous  structure  estimation  algorithms  (see  chapter  2)  allow  to  control 
the  influence  of  the  smoothness  term  on  the  reconstructed  surface  by  means  of 
a  parameter  A.  This  explicit  control  can  also  be  established  in  the  temporal  re¬ 
construction  algorithm.  We  begin  by  writing  the  separation  (6.8)  of  the  matrix 
Sk  into  the  diagonal  representation  of  uncertainty  and  the  representation  of 
surface  smoothness  S,  the  latter  term  now  weighted  with  A: 

Si  =  Sfc  -f  AS.  (C.7) 

The  Gauss- Seidel  iterations  for  the  update  now  have  the  form 

Sii  -i-  4A 

where  p,-'  is  the  sum  of  the  four  values  of  p  that  are  nearest  neighbors  to 
on  the  depth  map  grid  in  the  case  of  a  membrane  model  of  smoothness. 
This  explicit  control  of  the  resulting  surface  smoothness  is  useful  for  subjective 
evaluations.  Compare,  for  example,  the  result  of  the  pepsi  experiment  in  section 
8.3  in  which  explicit  smoothness  control  was  used  and  the  result  on  the  same 
sequence  in  section  10.3  without  smoothness  control. 

3.  The  depth  Z  is  always  positive  and  usually  also  bounded  from  above.  The 
subjective  quality  of  results  can  be  improved,  by  introducing  a  limit  check 
just  after  the  state  update  of  the  filter  which  ensures  that  depth  values  are 
positive  or  within  certain  bounds.  Two  strategies  are  possible:  One  is  simply 
to  truncate  depth  values  to  bring  them  into  the  desired  range.  An  alternative 
more  compatible  with  surface  smoothness  is  to  fill  in  depth  values  outside  the 
range  with  the  average  of  neighbors  that  are  within  the  range.  In  either  case, 
the  certainty  value  corresponding  to  a  modified  depth  value  must  be  reset  to  its 
initial  value  to  indicate  the  complete  uncertainty  about  the  actual  value  there. 
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4.  Very  few  images  allow  three-dimensional  structure  to  be  recovered  everywhere 
with  a  given  visual  mechanism.  The  reason  is  that  preconditions  for  the  visual 
mechanism  such  as  the  applicability  of  a  particular  reflectance  function  or  the 
brightness  constancy  assumption  are  not  met.  An  example  of  such  a  situation 
is  the  surface  of  the  soda  can  in  the  pepsi  experiment  which  has  large  regions  of 
uniform  brightness,  so  that  the  brightness  constancy  constraint  does  not  allow 
the  extraction  of  information.  In  all  these  cases,  it  is  impossible  to  recover 
structure  data  in  certain  image  regions. 

It  is,  however,  possible  to  subjectively  improve  the  structure  estimate  in  these 
regions.  The  key  idea  is  that  low  quality  depth  estimates  will  be  identified 
by  low  values  in  the  corresponding  certainty  values.  A  useful  strategy  is  to 
designate  all  structure  estimates  with  certainty  below  a  chosen  threshold  as 
’’low  quality”  and  to  fill  them  in  with  neighboring  depth  values  of  higher  quality. 
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